 are really one technical implementation which both provides a methodology for how to create containers. So like if you think about the Docker run command, that's a methodology for how you go and create a container, but it also implicit in running that command is the definition of what can, you know, what technologies in the kernel will get turned on when that container is ran. And so it's really, you're kind of, you're kind of accepting two things at the same time. So just I'll dig in deeper, maybe it'll make more sense hopefully. So, you know, this concept of a container engine is new. That's like 2000, what, 413 ish, you know, concept. So it didn't really exist. If you look before, there were libraries to create containers, but it wasn't really the concept of an engine. And so like, I like to refer to Docker as like the biggest proof of concept ever invented. It's a really good proof of concept. And it's interesting. And it was probably the only way to sell us on actually using this whole thing is to kind of see it all working together. You had to see all the container images sitting in a registry, see that you could pull them down and run them with one command, kind of understand how to interact with the container engine, which then goes and pulls those images. But you were buying into a lot of stuff all in one big, like, you know, proof of concept, essentially. And if you think about it, you were kind of in a walled garden where you're like, okay, now I see how this all can work because I was, I was, I was essentially demoed it in a nutshell, like I was given an API, a command line, registry servers, prebuilt images, all of these things as a proof of concept that all work together. But today, we need to break them down into separate pieces to do a lot of other stuff. Like an example is a perfect example is a couple of days ago, I was working with a customer on a Hadoop thing where they were trying to use a container executor with yarn as the scheduler. That's a scenario where you don't really want all the complexity of the, as I call it, the Docker proof of concept where you don't want client server interactions happening to a demon to then go fire off, talk to some other demon to then talk to you to basically fire off processes and I'll get into that. But in a nutshell, I'm going to break this all down so that we can see all the moving pieces inside of this giant proof of concept. And then I'm going to explain that drawing on the right, but I'm not going to explain it right now because it'll be too much for you. And then the other thing is, is there are other alternatives. And this was the slide that I was working on because I realized I left it out of my story. Don't ask me how. Everybody's used Docker, but there are alternatives, right? Like cryo is a container engine that is used for inside of Kubernetes. And then podman is a great container engine that is used outside of Kubernetes for fired up single node containers and pods. And so think of pods as multiple containers living in the same name space on the same network, accessing the same storage, that kind of thing. But like I'll dig into kind of how, but this talk is actually going to dig into generally how all container engines work, what they all do, what they have in common, and like essentially kind of understanding the nuts and bolts of what's going on underneath. So another concept that I want to kind of tackle first an assumption is, you know, thinking about the container engine versus the container host, if you really think about it, the container host is the container engine, because there's all these things that you need to think about. And when you think about it, especially in the context of Kubernetes, I think about the container host as the thing, not the container engine, I don't really want to swap out the container engine, because it's literally like changing an engine once the car's been shipped. It's there's a there's a lot of engineering that goes into putting that together, like you don't swap a Ford engine into a Chevy car after it's been shipped, like that doesn't make sense. And I'll dig in a little bit why and the last slide in particular kind of highlights where it goes beyond just the container engine and into the Kubelet and things like that. And pieces in the Kubelet actually talk to the kernel. And so there's actually a wider ecosystem of software that's talking to the kernel, not just the container engine. So that's the one last assumption I want to tackle. And then a short commercial break. And I will also include her Vashi's talk in this, but although I didn't look up when it is, but like there's another talk that goes deeper into Podman tomorrow by her Vashi and someone else. Dan, who else is it nailing? I don't know who Sally O'Malley. So like check that out if you want to dig deeper into the specifics of Podman as another container engine. And then tomorrow I will be digging into the container standards around, you know, essentially what makes all this work and why we can have three different container engines and they and they're not guaranteed to work because nothing in technology is guaranteed to work. But the reasons why this will work. So long story short, I won't dig deeply into that because it's too much to cover in one talk, but I wanted to at least kind of do a commercial. Oh yeah. By the way, my Twitter handle, I'll shamelessly plug myself is at the bottom of every slide. So you're welcome to follow me. All right. That was a bad joke. Sorry. So, all right. So what this is the main drawing that we're going to tackle. So we're going to look at all these different components within and we're going to kind of walk through this is what most people that raise their hand. Actually, let me ask you how many have used podman? How many have used cryo? All right. So this is good. So I start here because I think most people have used Docker and they kind of understand even though a lot of people have used Docker, though they don't necessarily always understand what's happening under the covers. And so I kind of show here, you know, it's not just Docker, right? It's container D Docker detox to container detox to run C, which then talks to the kernel to fire up containers. And at some point, all of these technologies get turned on in the kernel. And so if you really think about it, it's the kernel run C Docker kubelet, you know, this whole stack. So if there's some feature that you turn on when you're like scheduling something in Kubernetes, say it's like a privileged container, privileged container, you know, if you're telling the security context to run on, you know, privileged, you essentially you think about it, that has to be supported from the master API and kubernetes to the kubelet from the kubelet to the engine from the engine to run C and then run C to the kernel. And whenever you make changes to an API and you add a new command line flag, for example, it has to be supported all the way down that stack. So really, we like to think about the whole host as the container engine. Because if any one of these doesn't support something, it pretty much doesn't work. And then here's a simplified drawing a joke. So here's cryo. So I asked if anybody had used cryo, well, you wouldn't really notice it in the bigger context of things, probably in like a Kubernetes environment, because it just gets rid of a box. So we get rid of that box. So we get rid of container D and Docker and we merge them into a single container engine that understands a protocol called CRI so that the kubelet can talk to the cryo demon. And then the cryo demon calls run C, which then talks to the kernel and then you go on. But it's simplified stack a bit. So if you haven't checked it out and you're running, how many of you are running kubernetes? So not that many yet. All right, so that's good. Well, those people that raise their hand, go check out cryo. All right. So I tackle this because I think it's been lost in the ages. How many of you understand the difference between user space and kernel space? A decent amount. All right, that's good. This makes me feel good about life. A lot of the times I've done this in a crowd in like one person raises their hand. I had one time there was like 300 people at a startup conference and like one dude raised his hand. And I was like, what do you do? And he's like, I teach operating systems. I was like, all right, we're done here. And so I was whiteboarding stuff on the wall showing people stuff. All right, but in a nutshell, the pieces we're going to tackle like kind of the container engine lives in user space, right? And anything that it does, it needs to call into the kernel to make that happen. So in a nutshell, think of a system call as a special function that the kernel handles as opposed to other user space code that you wrote yourself handling. So like sometimes you write your own functions, but many times like people, if you do a file open, you didn't implement the file open that goes and like, you know, goes to the VFS layer and the VFS layer driver that then talks to XFS and blah, blah, you relied on somebody else to do all that. And all the role based access controls that are in play there to do all that, that's all handled by the kernel for you. So you don't have to do that stuff. That's all a system call is. So now let's talk about how like links processes are created. So the two most common functions or system calls that you would use are fork and exec. And so like if you've ever done a system command in Python or Pearl or Ruby or whatever, like you're essentially doing an exec or a fork there, you're basically firing off another sub process that goes and does some work, which I've done nasty stuff like that in Python where I call out to bash and I do all kinds of nasty stuff in bash because I'm lazy. And then it comes back and then I get the results essentially fork and exec are the two main ones. And so in this example, I show the user show, you know, you type a command into bash, bash forks or execs typically execs. And then and like, so if you run a top command, you know, or a PS, like it execs into PS and then returns. And so like, you everybody kind of understands that's how you run a process and in bash basically. And in that scenario, you know, you turn on some pieces of technology, right, you have access to the TSP stack, you have access to the VFS layer XFS, you can like write files, you can read from files, you know, in bash, you kind of understand all that, right? So that's a regular process. And then here's what what I would argue is the magical place where it becomes kind of a container or the piece that one of the foundational technologies that really allowed the Docker thing to happen, essentially is the clone sys call. So there's another system called called clone. And it's a special version of fork. And in that clone sys call, you pass it a bunch of other flags, which honestly, most people are not used to doing. But each of the you essentially pass it what we everyone's probably heard of namespaces. So namespaces are like the host name, the the net, you know, the net namespace, the process ID namespace, there's all these sort of virtualized data structures that can get created when you fire off the sub process. And essentially what you're doing is carving off just like virtualization carving off like a sort of little piece of the kernel, you know, a copy or a reference to these, you know, the actual global namespaces. Like so if you think about the process ID, you can create a separate process ID table that's virtualized, but it still points back to the global process ID table. And so like process ID one in the container might be process ID 527 outside the container. And so even when these things are virtualized, there's still a real representation of them in the kernel. Hopefully, does that make sense to everyone? All right. So then here's what it looks like, right? So when you do a, you know, with a regular exec VES call, you know, a process ID gets added to the global, you know, namespace, it uses whatever UIDs and GIDs are in Etsy password, you know, the net will be the same as the host. So we're all we're all used to running like Apache on a host, right? And it uses the global, you know, namespace for all this stuff. So essentially, like, you know, it's just normal, that's running a process in a normal way. Then if you run it in a namespace, so like I do this here, if you instead of catting Etsy hosts up there, you do it inside of a Docker container, you're going to do it in a virtualized place, right? So the process IDs are going to be virtualized. The UID, GID can be, you know, can be virtualized, the net can be virtualized, although it doesn't have to be. Each of these is optional because these are optional things that get passed to the clone syscall. And so this is kind of the first step where you go, okay, now I kind of get to understand what a container is. This is the first step in the definition of like what a container is. And then I think it's useful to take a look at this because this is two different containerized processes running in two different namespaces. But I think it's important to notice that when you do something like a mountain namespace, it's still relying on these drivers in the kernel. So like the virtual file system layer, XFS driver, the block, you know, driver that like reaches out to say an iSCSI volume or to a fiber channel volume or something like that. Those are not virtualized, right? Those are shared code. So like as soon as you do a file open inside the container, it's still relying on all those underlying subsystems in the kernel to go do with the standard work it does. And those are not namespaced. And so it's important to understand it's not virtualization, right? Within full virtualization, you have a separate running kernel in each virtual machine, and they have their own copies of, you know, all of that stuff. So they are fully virtualized. All of that stuff up there would be different between VMs, but it's shared with containers. So hopefully that kind of gives you some aha moment. I see people shaking their heads. So that's good. So all right. So that's step one is a clone syscall, right? And so I kind of start there at the kernel and kind of build up. So now I'm going to kind of go up a layer and talk about container runtime. So again, I won't dig deep, but the Open Containers Initiative defines what a container runtime is, and it's an open standard. And then there is a reference implementation called Run C, which is the most common container runtime. Docker uses it. Cryo uses it. Podman uses it. Builder uses it. Everything. Most of the things on the planet that are written use it because it's a very extensible and well, you know, it's supported community driven open source project that's kind of managed by the CNCF. So actually by the Open Containers Initiative, which is part of the Linux Foundation, I shouldn't say CNCF. But long story short, it's a community driven project that everybody uses. Now, other people implement their own container runtimes, which we'll dig into a little bit towards the end if I have enough time. But there are other compatible OCI runtimes that are like Cata containers or GVis or other things. But in a nutshell, what a container runtime does is it expects a file system mounted in a directory. And it expects a, so it expects like a root FS. So if you were to CD into that directory, it should have slash etsy and slash user and all the things that you'd see a few SSH into a server. And then it expects a config.json. I still have this wrong in here. It's not manifest.json. I fixed this in another drawing. I forgot it in this one. It's actually called config.json. And essentially it's a json file that looks, if you were to tease apart that config.json, it looks very similar to the command line options that you would pass to Docker. So it's got like a CMD and an entry point and a whole bunch of stuff. But things that you might not see, like seccomp rules and, you know, SE Linux things and things like that. So like there is a lot of things that come default in the container engine that get stuffed into that config.json that get passed on to run C, which we'll dig into in a little while. But now you notice in this side, now in this, in this drawing, I show you look namespaces have been turned on because that was with the clone syscall, right? But now we're turning on things like SE Linux, C group, seccomp capabilities. And so now we're starting to see more of the kernel technologies get turned on by this runtime, right? And what we're doing is we're standardizing the way that we talk to the kernel to turn on these technologies. So now we're really starting to see the formation of like a definition of what a user's, you know, container is in user space. And so run C helps us do that based on that OCI standard. And so that entire set of config options that you can pass into that config.json is pretty much the definition of what a container is. And that is when we say the words containers colloquially, we're essentially referring to that in a nutshell. So then it looks like this, right? So it's more than just the clone syscall, which I show here in the gray box, that's kind of the clone syscall. But then we turn on C groups, SE Linux, S for seccomp, all these other technologies get on, you know, get turned on. And then we start to say, okay, that's really a container. That's kind of what I think of as a container when I think of a container is that bottom thing, right? The normal exec v that I showed you earlier, you know, all the none of the things are contained. This one, the syscall gets contained, you know, essentially by these namespaces. And then we also turn on C groups to limit, you know, resource constraints, so CPU, memory constraints, things like that. So think of those as resource constraints. So that's to prevent noisy neighbor problems. And then seccomp, S for, I would call these more discrete controls that are mandatory in nature. So like seccomp is think of it as like a firewall for syscalls. So you can block certain syscalls, or you can whitelist other syscalls. And then S for it is a way to dynamically generate SE Linux labels, which I have a whole thing that go deep into that. But but think of it as preventing data structures in the kernel from talking to each other. So, you know, you can say this process can access these files and this socket. And that's it. And like, so it's a way of discreetly limiting dynamically, you know, each container gets its own dynamically generated label, and they can only talk to other data structures with that same label. So now you're starting to see a lot more powerful isolation form in this definition of a container beyond just the clone syscall. And then, you know, this wasn't the beginning of this, though, like if you look at like the Docker definition for a container, that wasn't the beginning, right? There was other things, there was libvert, there's LXC, there's and, you know, system dn spawn, lib container, all of these have their own definitions and their own user space, like command line options or library based, you know, function based options that you can pass to them to turn on containers that basically determine which pieces of technology get turned on. So all of these have their unique definitions. And now they've all kind of standardized on some of these similar things, you'll see like, you'll see like se linux and C groups and, you know, set comp and all these things are pretty common technologies that each of these define. But the permutation of which ones you use is not, you know, is not standard, essentially. Does that make sense to everyone? All right. So now we build up to the container engine level. This is where the meat of it is, right? So this is where, what does this do? Well, the container engine itself provides an API in a nutshell, and then it's able to go pull container images and then it prepares, you know, the configuration to pass it to the runtime, which I showed you, I told you the runtime expect a directory with, you know, the full file system in it and a config.json. And then you basically call run C with those two things and it will go far off a container in the way it should. But the container engine is responsible for pulling that container image down, decomposing it, pulling pieces parts out of it, which I'll dig into deeper, you know, creating that root FS then handing it off to run C. Does that make sense? Every container engine has to do that, whether it's podman, cryo, Docker, it doesn't matter, even build up because it has to basically fire up a container to then add stuff to it. Anything that builds or runs containers basically has to do this piece. So, okay. So now what does this look like in action first? So like providing an API, right? So in the Docker world, it means the Docker D. So if you think about Docker D, it's really what's providing the API. So, you know, if you connect to the Docker socket and you pass it, you know, API driven, not using the CLI, not using Docker CLI, but actually connecting to it with say Python or something, and then programmatically, you know, interacting with it, you're essentially using that API. Now, the command line also talks to that socket to go talk to that API. There is something called a CRI shim that also talks to that API that works inside of Kubernetes. So if you kind of look at the most common way that Kubernetes is set up today, the kubelet talks to this Docker shim using this protocol called CRI, then that Docker shim talks to the Docker D API. So you're translating between CRI and then the Docker API. Then the Docker API, you know, the Docker API daemon, Docker D is talking to container D, which is then firing off copies of RunCU, which, you know, passing it that config.json on those directories to then go fire off containers. And so that's kind of what I'm trying to show here, like what a system works like in real life, fired up, you know, this would be a containerized node, you know, a container node inside of your Kubernetes or OpenShift environment. So then the next thing that it's supposed to do, I mentioned it's supposed to provide an API, pull images and then, you know, prepare those to hand off to RunCU. Pulling and caching the images is often uses, you know, root level permissions things. So if you think about file system operations, you know, like mapping to overlay layers or mapping to device mapper layers, people don't realize that when that container image gets pulled down, it typically gets mapped right into the file system. So it's again, one of the world's best proof of concepts, like all this stuff is hidden from you. You didn't realize it, but it was actually doing all kinds of funky stuff under the covers. It's actually doing like root level operations to go map these container images into the file system, caching them locally, and then preparing them so that they're ready to run later, essentially. And so that was happening when you do a Docker pull and most people don't realize that. They think it's just like a file, you pull it's living in the file system, but it's not that simple. And that actually hung me up a while back, like I didn't realize, you know, like that that was happening. And so here I show, look, graph drivers getting turned on, you know, so, you know, X overlays getting turned on. So you're seeing, you're seeing that like this file system caching layer operation is actually using things in the kernel to actually, you know, decompose and map those container image layers to file system layers. And then, you know, then preparing the storage for runtime. So this is kind of the second phase of that, right? Once we've, once we've cached the container image locally, and it's mapped to these file system layers, be them device mapper or overlay to this thing called a graph drivers, what does that? But then we also have to prepare for in this, you know, in this example on the left, I show my scroll, because it's something that everybody should understand. If you run a container in read only mode, that cow layer that I show there will not be there. But if you don't run it in read only mode, which most people don't run it in read only mode, that cow layer will be there. And that cow layer is what handles, if you say echo, you know, hello world into slash eti hello, like it doesn't fail in a container, right? Like all of you have gotten into Docker and played around and interacted with it in a shell, you can write files, it seems normal. But really what's happening is you're writing stuff into that cow layer. And when you Docker kill, you know, and you're basically that cow layer just sits there on disk. And then when you do a Docker RM, it deletes that cow layer. But then if you do a Docker commit, it commits that cow layer as another layer to that image layer, and then maps it into the file system. And then you can then becomes part of the container image, if you want, then you can ship it back off. But a lot of people don't realize that's happening. And I've had, for example, questions where people were, I have an anecdote that I'd love to talk about with this one, where a guy came up to me at a conference and said, well, we're building Yachto Linux in a Docker container, and it's super slow. And he's like, why? And I'm like, are you bind mounting the data, you know, through to like a layer? And they're like, no, we're just building in the container. And I'm like, well, if you're compiling a Linux distribution, which has everyone here compiled a kernel, or at least how many people compile kernels anymore? Raise your hand if you've done it. All right. So enough people have done it that you understand, there's a ton of file system operations, right? When you compile a Linux kernel, and there's all kinds of, you know, meta data changes and things like that. Like, I mean, so it's a slow operation when you're building a Linux distro. So they were doing it in the CalWare. So like every single one of those file system operations was basically writing new data and doing it in a CalWare. So of course it was slow. So I'm like, well, do it in a VFS layer, you could do it on ice because, you know, be faster than doing it in this CalWare, even though it's a local disk. And so long story short, they did, and it was faster. But if you don't understand that this is happening, because it's, again, a black box, you won't know what to do. You'll just do it wrong because you're like, oh, just, I don't know, magically handles it for you. But it doesn't know until you, until you pass a volume into a container engine to tell it, hey, bind mount this thing externally, like put var lib mysql on an external thing, put slash var yachto or wherever the heck it builds, I don't even know where it builds, you know, where it builds that. But if put the build root for that kernel and for that entire Linux distro, because it's actually beyond, it's beyond just building a kernel, it's building a whole distro. So it's laying out the file system, setting all the permissions, doing all the things that file system, if you've ever done get into, you kind of get a feel for this. But, you know, building a Linux distro is doing a lot of things. And you don't want to do that in a CalWare, because it's going to be super slow. All right. So now that I've ranted about that, hopefully all of you will use bind mounts for now. And then just run, you know, the other thing I'd say is just run it in dash dash read only, then you can never have this problem. If it fails, you'll know that you need a, you'll know that you need a bind mount. All right. So now digging deep into, you know, essentially the container engine quickly. So if you think about what a container engine does, it takes CLI options from a user, often through that API. It combines those with defaults that are set up in that container image. And then it adds them to, you know, defaults that are in the container engine. And then it creates that config.json, which I mentioned, it creates a root FS from all of the layers in the container image, plus a CalWare, so, you know, it adds a CalWare. And then it, and then it basically fires up that container. And so if you think about the directory that gets passed to run C, it's got a CalWare on top so that you can write stuff into it. So it seems writable, even though it's not. And then it's got a conglomeration of all the user, you know, I'd say defaults that come in the image overridden, you know, and then defaults in the container engine, and then finally overridden by user options. So right, you could tell it to, so the default, for example, if you run a container in Docker, you know, it's not dash dash privileged, but it's not dash dash read only either. So if you, as a user, you tell it, you know, dash dash privileged and dash dash read only, which I think you can do. I've never tried that actually, it would override a whole ton of things that are in the container image and in the container engine, it would disable SE Linux, it would disable net namespace, it would disable all these things. And so like you essentially think about the user has a lot of control over a lot of these different options of how that config.json is going to get built and then ran. Does that make sense to everyone? Because this is the money. All right. And then I make it more complex. So I had realized that as I was explaining this to people, I had left out actually CNI and CNI does a similar thing for network, right? So there's a CNI config blob that gets generated that's actually quite similar. So if you think about what network you connect to or what ports you map, things like that, there are some defaults that come in the container image, there's some that come default in the container engine, and then there are some that are overridden by the user. And so if you think about networks really very similar, except that we pass it off to CNI to go do that. And then there are these binaries called CNI plugins. And those CNI plugins expect environment variables and a config blob to be passed to them. And then they know how to go configure, again, in the Linux kernel, these CNI plugins talk to the Linux kernel to configure the network in that network namespace. Does that make sense to everybody? It's a separate program that does it. RunC doesn't do that piece of it. That's where the CNI plugins come in. And so there's really a couple of different binaries that are working together to create a container. So does that make that? Because this one's pretty much the full money. All right. So and then as a kind of a final, you know, I kind of I mentioned, oh, and then in this one, I show that that's kind of a separate thing. But like I did. So I wanted to show in this last one, see how I turned on IP tables, the final frontier, I do a demo here where I show like in Kubernetes, if you scale a pod, you know, from like one, or if you scale a, a, you know, from one pod to like 10 pods, you'll see it go from 13 IP tables rules to like 40. And then if you scale it up to 100, it'll be like 400. And then if you scale it back down to one, it goes back down to 13. And you're like, that is a ton of beating on the Linux kernel, like adding and removing IP tables. And then also, if you've ever had some real world experience for this, if you run a production node for a while, you will see nasty stuff happen to like the IP tables config, because it just gets all kinds, it gets beat on really bad. And so this is kind of my final argument for like, you should really treat the entire host as the node in an orchestrated environment, because they're all, you know, even the kubelet beats on directly for the service layer, is everyone from over the service layer in Kubernetes. So like the service layer is essentially a way to give a name to a pod, you know, essentially, it does network connectivity to a pod. And the way it does that is it adds IP tables rules locally. So it feels like magic. When you just access the name, you're like, Oh, this is cool, it just works. You don't know whether it's doing DNS or what it's doing. But it's actually doing a ton of IP tables, you know, redirects essentially, magic and IP tables to make it like, so if you have 10 pods, you can just access that all 10 pods by one name and they get round robin or balanced, however, but it looks like magic to you. Well, the way it's doing that is it's going and beating on IP tables to go add a bunch of rules quickly. And then when you scale down, it's beating on IP tables again to delete all those rules. And so people don't realize that. And so that's why I emphasize that really the kernel is is is a, you know, a really important piece of thinking when you think about the container host. All right. I don't know that I have time to do justice to the bonus information because we are at 31 minutes. But I think I may break for questions and then hold on this and let people approach me if they want to. But I will say these were the three things I was going to go into in bonus because I think, and I want to always break out my joke. Kuber is not really one of these things, but I know that your brain wants to know it. So that's why I added it. But but in a nutshell, I kind of dig into each of these and kind of show what they are too. But there's no way I can do that in three minutes because these are even more complex. But with that, I will break for questions. So any questions on all this? I know there was a lot. I think it was a lot. Was it a lot? Boom, we have a question. Good question. I will repeat it. So his question is, is does the spec that defines RunC define the binary format of the containers themselves? The answer is no. And I'll dig into it deeper in the talk tomorrow, which I encourage you to come to, but there are three specs. There's a runtime spec, a distribution spec, and an image spec. And they're all part of OCI. And the image spec defines the binary format, which is not magical. It's just a bunch of tar balls. It's a manifest.json, which I screwed up in that drawing. The manifest actually points to the tar balls, which ones for different architectures. And then there's a list of tar balls, essentially, that are on disk in a registry server, for example. I mean, the formative executable. Okay, I don't quite understand the question. If there's a command that says, if the command from the manifest points to an image spec, is the format of that being Linux? Is that part of this PLF? So you're saying, I'll answer it this way. If you are running a Linux container on Linux, all of those binaries are elf binaries. And they're not defined by the OCI runtime. No, that's defined by Linux. So like, if you're running Red Hat Enterprise Linux, they compile their binaries and we compile our binaries in a certain way. And those days are typically dynamically linked, for example. And that is actually something I go into deeply in my Linux container internal stock, because most people don't get that. So when you fire up a container, if LD.so is the first thing it's linked against, it will go find all the other libraries on disk, load those dynamically into the memory space, and then fire up that inside of that container, right? Like all normal process rules apply. And that's beyond the spec of Run C. Run C is just defining, here's the different command line options, essentially, that can be taken from a, you know, from a command line in a container engine. Here's the summation of what technologies get turned on in the kernel to limit those, you know, to limit that binary basically being read. But no, you're left on your own. So like Windows containers have to run on Windows. Linux containers, I generally just say have to run on Linux. But there is obviously a syscall layer that's written in Windows that theoretically can run some Linux containers. I would argue, though, run the distribution, like if you have an Ubuntu server run of Ubuntu containers, if you have a rel server run rel containers, if you have Windows run Windows and scheduled them as such, if you start breaking rules with boundaries, I guarantee there will be pain there. Like my old sysadmin gene twitches, and I just start getting PTSD from when I would get paged at two in the morning because somebody's running a, you know, compiled CentOS container on Windows, and some weird syscall wasn't implemented, and it breaks in some weird way. It's a Turing complete problem. So I argue, just don't do it. All right, we're at time. So can we thank our speaker? Oh, this is on my laptop. Sweet. Yeah, I will. I think what is the best to do is get a shared supply. I will share this. I think I'll share. I think there might be a way to get a schedule. I'm going to work right now. Yeah, it has been recorded. I boiled down with four years of information. I learned a lot. Is it not right? It's not right. That's why the JVM's were rolling around. I was well, it was just too expensive to get together. This style. My answer is yes, Dan. Hi. Are you so sorry? Yeah. So if you, you can use your own laptop. HDMI out? My pro. Or VGA, right? So the only thing with, if you use your own laptop, which is not a big problem. Is that, um, then it will record through the camera. Like the slide? Yeah, yeah. We have, but they will be available in any way. So that's part of it. Check one, two. Okay, good. So that wireless kit has, uh, but only one. Yeah, only one. So I just hope that. Say that again. Uli. Uli? Yeah. So that means. The one says Uli. Except a month. Oh, okay, okay. So you can only use just the laptop. I'm sorry. Uli. Okay. Yeah, that's what happens. Uli. Drop that one. Yeah. I don't know. I can do that. So. Approximately. Check one, two. One, two. Testing one, two. Exactly. So yeah. Thank you so much. You can run it over there. All right. Thanks. And whatever you're ready. It's just an on switch on the top right. Yeah. And you should be all set. All right. Thank you. So if this is on. Oh, it has to switch it home. And five. Perfect. Bro, that makes it easy. Still in your palate. Come on, Dan. Dan, you don't need that. So shall we just start in time? Yeah. Those who are late, they have to suffer. So welcome to what I claim is the nerdiest talk of the conference. So I claim the title. If you don't want to be needy in technical details, which are probably irrelevant for your future life. So you might better leave now. Or you'll waste 40 minutes of your life. This is something which is of interest for many people inside Red Hat, but not really too many outsides or in other forms of work as well. So I'm talking about really the intricate details of how CPU is actually executing things. And I will actually use a lot of references to old technology because they're just simple to this. We will see this over time. So when I talk to people, so they even if they're coming with a CS background, most of them will be don't have any understanding of how a CPU actually works and what they're operating on. So they're thinking of this at the very high level. So let's talk at something very simple. So to perform the operations, which a CPU is normally supposed to do, we have at the very core from the very early on, we have something called the ALU, arithmetic and logic unit. In the earliest days, the internet computer only consisted of that. So there were paper tapes and paper, not a card, punch cards and so on, which are fed in the machine. But everything else was simply read from these things, punched or run through the arithmetic and logic unit and then we have an immediate output of some sort and so on. So this is the very core and this is for how curious to this they exist. The only reason they exist is they can perform something according to the wishes of the user of the programmer. So but to do this, we need what is called state. So we need nowadays in more complex machines, we don't want to have data streaming in, streaming out without any dependence between the different instruction or the different operations which we're performing. So we need to keep track of information. So one of the most obvious ones is we need to have what is called an instruction pointer. So even the simplest model, so if you think back to a Turing machine in the sort of theoretical aspect of a Turing machine, it had the position on the tape on which it is reading the operation. So we need this kind of information and that's we still have up to this point. So but if we want to influence this kind of thing, each CPU itself needs to operate even on something like the instruction pointer. So we need as the as core part of the CPU, we need instructions which are operating on the way, future way the CPU is actually executing. So we need execution control instructions inside the CPU itself. So and we are not there anymore that we are hard coding instructions which are executing on the on the CPU itself. So they are now living in either Harvard or for Neumann model in the CPU in the memory itself. So we need to be able to read instructions from memory. We're keeping state data in memory. That's the part which is easy to understand. Everyone knows about this. But for a CPU itself, it means that we need to have access inside the CPU to memory. So load and store, this is not something which comes normal. So even in today's machines, you see that we have the memory itself is not part normally of the CPU itself. It is somehow attached to it. This is a non trivial way and lots of implications what this means for the operation of the CPU itself. So and when we are loading things and so on, what became clear very early on is we need to have some form of representing what is the content of what we're loading there. We cannot just say, well load from there, do the operation and store it somewhere else. It turns out that this is far too slow. So we get to this later on. So we very early on introduce in computers the concept of registers. This has simply specific high performance locations which are close to the CPU core where we can intermediate the store data for a short period of time for certain operations and so on. So we have this kind of infrastructure in the CPU itself but to able to utilize them, we need more operations inside the CPU. So we need load and store operations. We need transfer operations and we need to move things between different registers. So not just between the memory which is loaded in store but also between the different registers themselves. So and the whole thing has to nowadays run on operating system which adds more requirements on the operations. So we need to actually have operations which allow us to administrate the running programs themselves. So in the early days we only had one single program occupying a single machine at any one time. This became really, really inefficient over time. So now we have in most of the systems and then the clear distinction between the system mode of the CPU where we can administrate that we can run multiple programs on the same machine at the same time as if they would be running on their own. So they themselves don't necessarily have to know about this and this needs some form of abstraction in the form of some instructions which allow us to perform some system operations mode. So this as I said is there not only for isolation as I said but also for security reasons. So all of these kind of operations have to be performed by CPU itself nowadays but how they actually how this actually works well they're going to some of the details here. So I'm not going into all of this. This is far too much unfortunately I don't have a week for you to listen to me which would very least need to actually cover everything. So I will not go into many of the details which are related to the memory handling and so on. We will just a very short excursion to that. I'm mostly looking at some of the more nerdy aspects of instruction encoding which is the next thing we're talking and how individual instructions actually get scheduled in the CPU and what kind of advancements have we made in the last decades in these in this areas. So instruction encoding is something quite abstract to most people for those who are living in a compiler world and these kind of tools this is a natural way to think about this but just imagine we have a general purpose CPU and this we have to tell it somehow what kind of operations it's supposed to do next. Well we can do this in various different ways. So in the simplest form which is usually called the three address form we have something that when we say well we encode some in some way or form using a number we numbering them through an operation which we can perform then we have a source or two source operands perhaps and then we have a destination target somehow specified in this all in some form which can be stored in memory. That's what Sember is creating that's what usually called is machine language so that we have a sequence of these encoded instructions in memory which when they get executed sequentially perform the operations the program is designed to do it. So in the simplest form an operation as in this case here for instance would operate on two registers there's all internal to the CPU itself so we're saying take the parent for the operation which is specified in the subword there take the operands and two operands from these and this register and once you're done with the operation store it in that register so that's the simplest form. The same thing has to be in most cases also allow for a form where some of the operations some of the operands or all of the operands are coming from memory at the very least this has to be done for operations like load and store the things which I mentioned at the beginning that in load and store we have to somehow encode a memory address from which we want to load the data or where we don't want to store the data and in this case also register in the sense so it's the register from which the data is taken or in which it is stored so these are some of the simple instructions which we can encode other things other operations might slightly vary maybe they need a couple of different fields in there but in general this you get the gist this is the kind of thing where we have to do and how we can encode this how this looks in practice for different processes is actually quite different in the what is called risk world we use instruction set computers we have usually the instructions encoded in 32-bit words so there are variants we have compressed formats we have truncated formats and so on where this is somehow made smaller this is mostly done for embedded systems but in forest systems most of the time we have 32 bits available to encode odd instructions with all the parameters etc so what you see here both at the top is where you have a single instruction being represented encoding of the instruction represented and at the bottom which is an overview over the different instruction level this comes from risk five which is a completely freely openly developed CPU architecture initially out of Berkeley and you see that it's really uniform in the sense that you have different fields inside the 32-bit words which makes up the instruction which makes it very easy for a logic for a hard-coded logic in the form of an ASIC or something like this to get to the individual pieces of information we always know that within let's say 12 bits inside the word we find that this is always the destination register it's some form of register and we get this information out there's no decoding needed we just need to get this out and the actual instruction which is to be performed can also be very easily looked up so you see there the errors originating there from some of the first bits in the words in the 32-bit words there are the ones which are used to access this kind of table here so the table itself describes what kind of operations at the high level are to be performed if these bits these five bits are half a certain value so for instance all kinds of load instructions have all the five bits which we are seeing here to be zero so in this case we can just write some very simple logic in for the CPU itself which is accessing or initiating the execution of what is the load instruction whenever it sees these zero bits there it's very simple it's very fast there's not much logic necessary not much electricity necessary to do the decoding there and actually start the execution on the other hand this is the exact counter example this is how the Exodus 6 instruction form it looks nowadays and if Intel gets the way it gets complicated with more complicated with every single day so they are inventing yet more and more of these kind of instruction encoding extensions and you can imagine so this is not on a single 30-bit words all of these blocks is individual blocks and these are individual bytes and not all of them can appear in the same order so you actually have to decode the first byte to know which path to take to decode the second byte that's the sequential operations we hate sequential operations it has to better be in parallel so in a RISC architecture we can decode all of them at the same time all of the different fields here we can't this means that to accelerate these kind of operations and actually perform them in the first place an enormous amount of logic has to be necessary in what's called the instruction decoder in a SIS system like Exodus 6 and if we are going to talk about this more we want to actually be able to decode more than one instruction at the same time imagine the nightmare so we don't even know where the second instruction starts leave alone what the individual fields of the first instructions are so these kind of operations are requiring lots and lots of logic there are thousands of I don't know actually what they're known so millions of gates necessary just for the decoding and to keep these gates running with the electricity so you can argue that the instruction decoder of one of the high-end Xeon chips probably takes as much energy as an entire arm chip at a low end range it's mind boggling what we're stuck with it so that's something to actually think about when you're looking at the CPUs themselves if you're not doing this if you're looking into a really energy efficient computing and so on the Exodus 6 really should not come to your mind and Intel has the problem that they even wanted to base their future their future architectures like for instance you have the arm you have the makings on still on the Exodus 6 architecture they simply cannot let go it's really really sane a strange situation for them so how does it actually the CPU is now actually executing what what it's supposed to do so with the instruction format I introduced the concept of decoding which is also in here so but the actual sequence of executing instruction can be summarized in these steps and this derives from the fact that yes we started out defining what the CPU does even when we didn't have integrated CPUs when we still had had transistor logic to make them up explicitly so we still had this as a sequence so we are fetching the instruction from memory we are decoding it that's the thing from the previous slide we are trying to figure out what does this instruction actually do then we have to decode it and find out oh yeah here are the parameters fetch them fetching them can mean in different ways so we can read them from memory we can deduce them or get them from the instruction itself there's so called immediate instruction which are decoding parts of the operands for the operation in the instruction itself or in hopefully in most cases the data comes actually from the registers themselves in there but then all of the kind of things depend on the decoding to have happened before before that we cannot really start so and once we have all the parameters in place in wherever they are necessary we can then finally start the execution so I'm not sure how many EEs are here and how many of you have then designed your own CPU you know that if you're writing something like an ELU and arithmetic logic unit it's not as if you say ah here are the electric inputs and well a nanosecond later I expect all the outputs signals to be available at that point they have propagation delays there are lots of logic in the meantime so there are things which it takes time in short to actually finish these kind of operations so there are limitations when it comes to actually execution and only if we are reducing the frequency of the chip itself to a really really low number can we really expect that the propagation doesn't really have an effect any big A so I don't know where we are ever started on machines where we had less than megahertz available and for CPU frequency at those machines we didn't care about that we had one single cycle going before we actually had the result of the of the instruction available that was nice but over time we sped up the process by a factor of 5000 or even more well all of a sudden the speed of light and the propagation of signals actually makes a difference so this is not the case anymore then once we have the result we have of the computation of whatever form it takes we have to write back the result this can be into memory can be into a register so we also have to update what's got the state of the CPU usually in the form of status flags for arithmetic unic or other things and once we are done that we have finished executing the instruction and that's the way how to this point in the logic we are executing instructions but if we would constrain ourselves to executing them really like this so all the sequentially and before we actually haven't reached the end of step five we cannot step up and do this for the next instruction we could not scale up CPU's performance levels we are seeing right now so what I'm going to describe going forward is how this actually works how many of these things have been improved over time but first well let's go a step back in history so anyone knows what kind of CPU this is well it could in theory be the z80 but it's actually 8080 no no it's 8080 which means that this is the data hidden registers which were in the 8080 so the 8080 was a successor of it so I like the 8080 as an example because it's so simple we actually understand it at the transistor level nowadays we have everything freely available and I want to go through these steps on the example of the 8080 because in theory really nothing has changed we have the same components to summits and they are a lot more complicated and they're working differently and we've had some additional components but all of these exist as well so for step one for fetching things what is involved we have some internal register the temp register which is taking the current instruction which is to be executed and we load the instruction from the memory the memory is addressed by at the bottom right you see that's an address address bus so we put the address on the bus and then next cycle we can read through the data bus which is on the top the white for memory which is stored into the temp register then the decoding happens in the 80K, 5080 that's the PLA but nowadays it's a lot more complicated of course as we alluded to before in instruction encoding so these kind of things used to be very simple what they are doing is they're setting various of the internal lines to address data flow and also the execution selection and so on internally it's basically a lookup if this instruction comes in then set these lines and at this time cycle so next comes the fetching of the parameters 8080 was a simple machine it didn't really have the three address form so there only can be one additional parameter address in every single instruction so we only need to worry about loading one of them and that's basically loaded into this temp register which you see there the accumulator ACT is an explicit register which you fill itself so that's always implicitly used in the arithmetic operations and to load something you need to again put something on the address bus or you get it from the database in the next cycle or you get this from the register block on the right-hand side so you get everything in there then you do the operation the ALU is triggered by the PLA if I'm by the instruction decoding the PLA sets the appropriate bits to instruct the ALU what kind of operation to be performed and then we write back the results so either in the in any of the registers including the ACT registers in this case or we write them out for instance for store operations onto the database after previously having selected the trace using your address buff and we are also updating the flex so this is the kind of operation which are going on all the time but at that point we are running into one fundamental problem and that is that we cannot speed this up indefinitely because memory is low so if you think about this in the terms of how we used to do so this is this area so phi is the clock speed in a one megahertz world with static RAM attached to the CPU we were able to rely on the fact that after we put the address on the address bus in one cycle the next cycle we can read the memory memory content worked so but now let's speed up the whole thing two gigahertz clock speed well now it takes a hundred cycles to read from memory and just imagine what this means from fetching the memory it's fetching the instruction to memory to the decoding phase we would have to wait a hundred cycles therefore the frequency would not be two gigahertz it would be actually 20 megahertz wow less so this is not going to work we need something we need to be able to do something in the meantime so we have to modicize the memory addresses so we are not loading simply a single byte we need to actually load more than something like this and make it available inside the CPU itself me we need to do something while the memory accesses are actually happening so we need to keep the CPU busy and these are the guiding principles of the last 25 years of CPU design to increase these kind of things increase what was called the IPC the instructions per cycle right to more than one so that we actually can perform more work there so this is the overview which I'm going to use now which is describing exactly the same things which you had before so the steps with a couple of different additional blocks there so at the top you have the decoder which after fetching the instruction for memory is decoding instruction what has crystallized in the last couple of decades is that after that we actually move things in what's called the decoded instruction cache or it has various different names I just call it this and this is simply storing the decoded instruction whatever internal form the developer of the CPU finds useful for each of the incoming instructions which is not that problematic for a RISC CPU in theory we can store it in the RISC format in there as well unless we actually want to have something else but for a RISC CPU like actually the six this is crucial because the decoding as we said is so complicated and what is more to the point nowadays X6 instructions actually don't get executed as is the X6 front end actually translates each X6 instruction in the number of micro ops which are executed so the decoded instruction cache actually caches all these micro ops which have been the result of the decoding so after that we are getting into what's called the reorder buffer so this is something which I'm going to talk about in detail now is the piece of data structure which is pulling out of the instruction cache the decoded instruction cache instructions one after the other once they can be executed and that's the important part the first part the first decoder and the decoded cache they are in order in order executing all the instructions in order the rest of them don't necessarily do this and what this means we are going to talk about now so let's talk about this actually means in in term of some actual code and I have to apologize that it actually looks a little bit weird now going forward because the font for some reason changes changed from the time when I actually wrote the slides so there for this is now it's okay but later on you see there the markers which I have actually offset by some random number I gave each of the units which you see there an individual number and we are now looking at an instruction so instruction sequence which you can see on the left hand side how this actually proceeds through the CPU itself so the first thing is the first instruction FLD which is a floating point load gets decoded it's the first they were fetched from memory then it gets in a single issue when we are talking about doing it in a traditional way so the instruction gets decoded and then it is it can be operated on so in this case the instruction is in the decoded cache and because there is nothing else going on it can immediately be executed and so on in the meantime so remember we want to do multiple things at the same time the decoder is not doing anything anymore so we can actually get to the point that we have the second instruction decoded why is there no caching cache access why is there no one in my example here remember one of the things which I said before is we want to amortize accesses so we don't want to access memory for everything instruction so along with the memory necessary for the first instruction we get memory for the second third and whatever instruction as well so these kind of things are necessary for performance we cannot wait on memory every single time so this is instruction caches and what they are for so in this case so we have now some form of parallelism there so we have the first instruction being actually executed and the second instruction being decoded and this continues to be done so once the execution of the first one is done so it starts the execution actually and so on it's because it's a load it also uses the caches etc and the second instruction can be put into the reorder buffer and the third instruction can be started to be executed so that's actually more efficient already much more efficient than what we have seen when looking at the 8080 execution model where we had the single instruction in flight at any point in time but it's still not really that great specifically because well the instructions which you see here for instance the first and second instruction they actually are not depending on each other in any way or form they could even be executed in the reverse order without affecting the correctness of the program so what has been done in the high-end CPU design is to analyze on the fly while the program gets executed what are the dependencies between the different instructions and that's what is represented here so here I gave every single instruction a number and sequence and on the right hand side you see a dependency graph of the different instructions so only if there is a straight line so an error pointing between two different nodes do we actually have dependencies so which means here if you look at this there's no error between one and two and two and three and three and four and five the first five instruction could actually be executed at exactly the same time if we have the necessary bandwidth there and it turns out the high-end processors nowadays have these kind of things so we are talking about multi-issue CPUs where we have decoders which are capable of decoding one-in-one instruction cycle and the decoded instruction are stored in these caches and therefore if at some point we have enough instruction in the decode cache and we have resources to execute them in parallel and there are no dependencies there we can execute one-in-one instruction at the same time this is how we get IPC numbers which are larger than one that's a very important thing for performance there the other thing is I so that's basically I have this so we this is the explanation now in graphical form so instead of first being the first instruction being decoded and so on all of the instructions are decoded and here again for risk trivial because we just know every single instruction is 32 bits in width for assist especially exit is 6 horrendously difficult so you see there what kind of hoops they have to jump through to actually make this work and so and once we do the decoding we can all store them in the instruction cache and depending on what we have in terms of execution units and so on we might be able to start executing them but that's not always the case so we don't always have enough execution units in this case there's single one so even though in this case we have the second, third and fourth instruction already decoded it's already new reorder buffer if we are not also having a specialized execution units we are still bottlenecking there so we only have one single operation which we can perform with anything in time which means that we actually want execution in addition to having parallel decoding et cetera in these kind of operations so nowadays you will see if you look at the block diagram of CPU all the longest I have no Q and A the you have multiple execution units actually done for a special purpose so you have address your floating point operations et cetera so they're usually specialized in some way or form but you have multiple of these these pipelines going at the same time so which means that if we reach the state of the execution of the first instruction we actually could not part perhaps not only the first but also multiple instruction at the same time if we have this and we have no dependencies between instructions this is why this dependency tree getting to it is very important and you can imagine that it's also now something which compiler writers have to have to take into account they're actually writing code so that they're generating it so that the CPU has a much higher chance of starting to work on multiple execution at the same time because there are no dependencies between them so in this case now we're happy with this so but now what happens in this case now the J stands for jump that's with 5 assembly by the way for those who don't know what happens in a jump instruction so we have it to the point that we have decoded the jump instruction but we cannot actually start doing anything else until we actually execute the jump instruction the executing the jump instruction means that the instruction point of which is a register in the CPU has been updated to point to the next instruction at which point we can fetch from memory the instruction and decoded etc etc so and because we know that fetching from memory and decoding and getting it to the place that we actually have the instruction in the rearer buffer takes time we actually have dead air here we have bubbles in the pipeline as people like to call this kind of things so before we actually get to the point that we can fetch the memory the location here and here you see unfortunately it's the location it's supposed to be here of course and that's just the problem with the font sizes so until we are actually at the l3 level here we have nothing else to do in the meantime so that's really really bad this is where branch prediction comes in that has been in the news for us a couple of months a lot this kind of we see how necessary it is to actually have this if we have what's called branch prediction available then after the this instruction here is actually decoded we can already make a guess as to where will we be executing next and so the CPU will in most cases already start fetching memory from somewhere it's not necessarily the right place but it from somewhere it will start memory fetching memory and decoding instructions so this is something which is which is going on all the time and well but it's something at some point it might be wrong and how do you handle with handle this these kind of operations that you actually get a good prediction it's a bad prediction because in case of bad prediction you have to roll back all the computations you're doing and start from scratch it's a pipeline store so you're actually losing lots of performance so what is being done is we are having branch prediction units which are taking which I implement there's a state machine using these kind of things that's a simple one nowadays an exit of six you will find something they have the most advanced one which is looking something like this where the branch address the actual physical address of the branch is taking into account together with a couple of other inputs using a hash function which is then looking at the global address table which is using the state machine from the previous one and this will give you well yeah it's it's the target address is this so this actually works we mock more remarkably well so there's another limit to to the parallel execution and so on and this is so if you look at them and I think they might be wrong so these are these instructions so these four instructions can be executed in parallel so but what is about the next instruction now so this instruction here actually is loading into the A2 register but in this instruction here is using the AL2 register so you could say well this is a dependency but in reality it's not it's what's called a false dependency and we have to recognize these kind of things but if we have a single location in the in the CPU where the content of a register is stored then we cannot handle that efficiently so what is nowadays actually done instead of having for each locator for each register a specific location available we have what is called register files nowadays that's a concept which I found most people have no clue about how this actually works so registers are nowadays not actual locations for data registers are pointers into a data structure above of RAM so to say a very very fast RAM where the actual content is based on and if we come to a false dependency we are simply allocating memory in the register file for new memory in the register file for these loaded memory in this case and pointing to this new location say oh yeah by the way from now the A2 register is actually here the old memory location stays in place it's not affected so the old instruction can actually execute at exactly the same time there's a clever concept by itself which means that we can actually execute all of these things at the same time because they are not really having dependency problems so the last thing which I have is I want to introduce the concept of pipelines and for that I give you a couple of nice diagrams so this is the concept how an adder works an AP data and I talked to you already about signal propagation is limited in its speed so there are lots and lots of gate lots of trenches and end gates and so on going on here and therefore we cannot perform arbitrary many operations at the same time so probably complex operations so as an example and this is not the case today anymore so but imagine you want to have a 16-bit adder available you can implement this by writing this one here to 16-bit it gets more and more complicated because you need more and more levels of logic to actually handle the propagation of the carry bit but you can also construct a 16-bit adder by having two 8-bit adders the problem is that you need to store for the time the first 8-bit adder gets executed the inputs to the second bit and you need to store the results of the first adder until the results of the second adder become available that's done using ledges so they are synchronized they are there on the clock of the CPU but it means now that you have using limited logic in 416-bit adders but the result is not available after one cycle but two cycles well but the lower half is not busy while the second half is being executed so what happens now in a pipeline a CPU pipeline is that instead of waiting for the result of an arithmetic operation for instance to be available in the end we start with the next operation before that and for our complex ones of the multiplication we have actually I think the latency nowadays of 7 cycles or something like this we can have multiple of these operations going on at the same time without this this is an efficient way of doing this but it requires also help from the compiler in lots and lots of logic in the CPU I have to make it quick now so that's the last slide I actually have so everything was very very simplified in the talk here if you want to get an image of how complicated this is in a block diagram of the Intel Skylake you see all the things which we talked about here and a lot more but you can imagine how all these things have an interdependencies between each other and all this has to be put in place and we have to write code we have to generate code to compile this to actually allow these kind of interprets into play to actually happen efficiently we can very easily write code so that this thing stalls and it behaves like a 10 megahertz processor the art is to write code to actually utilize all of this logic exactly all of it in parallel at the same time only then are the CPUs are really really good that's the magic of how we are writing high performance computing code where much of this is working where we need to express things like parallelism not implicitly for the CPU to be discovered and the compiler to be discovered but we have to make it a little more explicit so this requires programmers help to actually utilize fully and yeah this is why we have to understand how CPUs work this is just as I said not even trying to give you an impression of or knowledge how this actually works it just serves as well hopefully you're going out here now and you're interested in this topic there's lots and lots of literature of course available about these kind of things and in my opinion you cannot ever write good code without understanding how CPU works and how compiler works right that's it any questions yes hello can everybody hear me great guys everyone thank you for coming here today I'm very excited to have you here I'm here to introduce Will he'll be presenting on Welled SO and beyond going to be a great talk and I'm really looking forward to seeing this come to life bye guys all right hey everybody uh am I okay I feel like I'm audible if you can't hear me then cool all right I got somebody in the middle I want to I guess I'm going to say I want to thank you in advance do some circumstances be on my control these slides are not finished and my luggage arrived right before I got here so I don't have slides but I do have clean underwear on so in the end I think you'll agree that's probably the better choice so I'm here to talk about Welled.so and beyond Welled.so is a thing that we kind of made up so I'm to give you a little background I am a senior software engineer at Red Hat I've worked on installation and upgrades for 15 years and so our team has done has a deep history in doing things that involve image construction and packaging and the sort of weirder use cases of that the stuff I'm going to talk about is is not like super or it is fairly specific to rpm and the way that we at Red Hat build images and things like that but I think the larger lessons sort of apply to the entirety of the Linux ecosystem so oh yeah and some quick disclaimers nothing here nothing that I say here is like this is off Red Hat is definitely going to do if you a quick show of hands who in this room works for Red Hat okay for the rest of you the people who had their hands up we don't know what number comes after seven we can't count any higher than that so this has nothing to do with whatever numbers might or might not come after seven I'm fairly sure there aren't any and yeah I wrote these like two hours ago so thank you in advance for bearing with me as I sort of ramble at you so the whole thing I want to say the sort of promise of this well.so thing is that like I'm pretty sure but looking at the way that we construct images and by images I mean like file system images like containers or virtual machines or even just like doing an initial install on a system the way that you do that is like we're using you know we're using rpm we're using packages do that and rpm and friends were designed in the late 90s and they made sense at the time but there's a lot of slack and stuff getting in its own way that makes everything we do harder than it needs to be and with some tweaking we could have a system or a basically a um Linux a model for Linux distributions for going from you know upstream projects releasing sources to code that you're running somewhere that was like insanely fast and reliable and just did everything really easily to the point where I'm pretty sure we could start up a build a container for every process as it starts up using only the things that it actually needs in its file system so every process would have its own view of the file system the same way we do with virtual memory where every process at startup the dynamic linker goes in and links in the libraries that it needs and then actually jumps into your process we could do the same thing with the file system and if you think about it it's kind of weird that we don't do that with the file system why is it that every process gets its own private view of memory that the kernel then arbitrates but we give them all the same view of the disk it's sort of like it was designed in an era where disks were really slow and there weren't a whole lot of computers and the idea of like doing that was just crazy and they're like we'll get to it later if we can do that if we can build images that quickly milliseconds or less we can actually do things like os wide c i n c d which has been sort of troubling for redhat to do that sort of thing or in fedora for that matter where we can't we don't do nightly builds of brahite or we do sometimes and sometimes we don't we don't do nightly installer image builds because we do that's kind of my point I'd love it if we succeeded that would be cool but yeah so I'm pretty sure that we can just tweak a few things and I say tweak a few things but it's actually like a fairly it's a lot of small system wide changes that we would need to do to make this sort of a world possible but it is a lot like the shift from statically linked binaries in the late 80s and early 90s to dynamically linked binaries and we don't like think too hard about dynamically linked binaries as being like a crazy new thing anymore but like there was a time where that was new and controversial and people hated it and there were solaris admins that were like I will never have dynamically linked binaries on my system ever you know fist on the desk so like if you're getting if you're getting a feeling in your chest like this is craziest it'll never work why am I why am I listening to this guy like give me a minute just like clear your head so okay what is weld and welder just an acronym that we kind of made up for a experimental Linux distribution that we're sort of working on like what's it gonna be the meaning of the acumen acronym changes depending on my mood I think it was originally will's experimental Linux distribution at one point it was the Wiggum Enterprise Linux distribution because of Ralph Wiggum and but yeah we're basically like figuring out new ways to do learning sister stuff because the way we do it now as I'm saying it's like it's gnarly everything is a lot harder than it needs to be and we seem to have a lot we spend a lot of time fighting with ourselves building new tools to deal with problems inherent in the system rather than fixing the system itself everything like think about how many things features of the RPM ecosystem if you're familiar with it like comps groups every time we need to add something we add another layer extra layer of metadata and another extra layer of code to parse that metadata and at this point if you're dealing with I think modules involve like you have to fetch some YAML which then tells you where to get some a SQL lite database that tells you where to get the XML tells you where the SQL lite database is and then you can start parsing like other like there's 18 different types of data involved and it's a mess and we could do a lot better so I already kind of went through this stuff but I think that the heart of the whole problem is that we're stuck on RPM and anybody who has talked to me at all in the past like 10 years is sick of hearing a yell about RPM and I'm sorry but the problem with it is that we have so much of what we do encoded in it that everything we build is built upon it and we keep building layers and layers on top of it but nobody understands how it works anymore like nobody actually knows the entire details of how the dependency system works or what's in the dependency tree or what the what's in all of the scriptlets and they're like non-deterministic and we don't know what they're doing but we have to run them as root because they might need a root so we have like this enormous mountain of like code that we can't introspect on we have no idea what it's doing and it has to run as root and that seems like a find to everybody which kind of makes me cry at night but that's whatever um yourself sorry it's big for yourself be its fine part oh the it's fine part well no everyone just sort of looks at it and it's like I guess I mean I don't know what we can do about it and I'm like we could not do that maybe like just throwing that one out there and yes so people are working around their problems and one of the things that I think is happening with our big shift to containerization is that and um or like go where go by default builds static learning to binaries and it's because of the troubles that we have with dependency resolution and things like that where you have your red hat or you have your fedora system or your rel system and you're like okay but I I have this thing that needs these libraries and I have this thing that these other libraries how do we do that how do we get two sets of libraries that oh my god the package has the same name but they're different versions they can't possibly coexist in the same system I don't know why they're just files but okay rpm says they can't and so instead of like addressing that problem we as a community slash industry have been like what if we just went back to statically linking everything that was easier and like it works better but it's not a better idea it just works around that one problem so the whole thing and what I kind of went crazy doing for the past I'm not going to subject you to it but we don't really have a good model for how rpm works like if you look at the Linux kernel there's like people have done formal memory models for how the Linux kernel deals with memory and what it does when when it takes locks and all of that stuff to sort of maintain the the illusion that everything is safe and reliable it is safe and reliable asterisk but uh you know as long as you trust your hardware ha ha we don't have anything like that for rpm or packaging in general like we don't have any sets of like assertions that like okay a well-behaved build system needs to maintain these sorts of invariants or like we have we can assume that this is true but we can't assume this is true so like we just sort of make it all up as we go along and that's where things get weird like oh we can't install two packages with the same name with different versions because reasons and it's like okay well if we had a model maybe we could deal with that so you know I have a big like crazy model for how like written out mathematical model for how you do a Linux distribution what it actually involves the pieces aren't that difficult so it turns out if you kind of take it all apart and instead of like when you build a system using rpm what you're doing essentially what the installer does the installer anaconda is a mini distribution and you boot a dvd or whatever and it starts this mini distribution and it used to be like a full-fledged mini distribution that had its own like mount binary and a knit system and everything that we had to maintain ourselves and it drove us completely bonkers so eventually we made it so it's just a small fedora distribution it's very very stripped down that we boot and load into memory and we format your hard drive and then we start installing stuff into it which is kind of bonkers when you consider that most of what people do after that finishes is they remove all the stuff they didn't want or they make a bunch of changes afterward because we don't know what happens in the middle it's all mysteries to us and it's really weird that we take a we take an empty box and then we like we have these little packages and people talk about packages like their bricks and then you just sort of stack them up and you make a wall but they're more like tiny little robots with like chainsaws and arms on them and like you dump enough of them into a like little room and they sort of fight it out for a while and then they build voltron and you're like cool that's neat and like we made it all work and that's amazing like good job us but like maybe we could make bricks instead is kind of my whole point like I'm going to come back to this a lot and I'm going to use a lot of dumb analogies like that but my point is that as a community and an industry we need to start looking at the places where behavior is not introspectable or not deterministic and start stamping them the hell out because they make everything else it sort of propagates upward if you have parts of the lower levels that need to run unknown code as root well your upper level either has to work around that or just deal with that maybe happening sometime and you can only make certain guarantees about what's going to happen in the middle it's not great um so we sort of did an experiment on our team where we're like okay so what if we were doing basically what RPM does to what the installer does to build systems so instead of actually doing the thing we open the box and we throw in all the robots and let them build voltron we take all of the pieces and we scrape off all of the code from the outside and just make them little bricks and we just put the bricks in place so just like take the contents of every package just lay down all of the contents and then not run any of the scriptlets at all and it works like through a first degree approximation it works it turns out there are some things that actually do need to be dynamically generated but they're really well known I have a different talk about this about what RPM scriptlets actually do there's only like six things and it's like create users um you know generate like it's basically sometimes you generate a file like you have to you have a dynamically generated list of users okay fine sometimes you do things like you create a host key well you don't always want to do that right like you only want to do that if you're installing on bare metal if you're making a image for the first time that's going to be a gold master you're going to replicate everywhere it does not need that machine specific key so we need to look at all of that but that's not what this talk is about the point is you can kind of throw it all away and it all still works so when we did that we managed to make we built a thing and this is on our website and it's janky but we can put together a system about I think it was a and Dave you can confirm this maybe it's a 100 times faster than our current stuff we can build yeah it was so we had an internal team trying to do continuous integration stuff on the kernel and so their whole deal was you know build a minimal VM spin it up do some stuff with it or build a minimal system and then spin it up and do some stuff took them about six minutes to build that image and then they could run their tests and we could build the equivalent image in six seconds and that's all I owe so going back to my yes this is basically dynamic linking is what we're doing there at that point what we're doing is taking in the same way you do with memory you're taking an empty process and you're sort of dumping in the pieces that you need this is the same thing as dynamic linking in that and we can just sort of borrow a lot of those ideas and make image construction way easier and way faster this is one of the things that Elf got right that rpm gets wrong where rpm is hard to extend like when was the last time we've added a couple of tags now and then right we had weak dependencies after 10 years of fighting about it we got we have build requires I don't think we have test requires yet yeah we've been fighting about that one for my entire career at Red Hat so like rpm is notoriously hard to extend it's also it changes without warning like fun fact there is a specification for rpm technically in the Linux standard base in that it's a de facto specification they wrote down how rpm worked at the time if you implement rpm from that specification it won't do anything we changed how we store file names in rpm headers and didn't even upgrade the increment the version number of the file format we're just kind of like change stuff out from other people all the time without telling them it's not great so this is the other thing I'm going to hammer on to us as a community we need to actually start documenting how stuff works and like commit to not breaking it unless we're going to is that 10 left okay cool not committing not breaking things without at least warning like incrementing a version number it's not that hard rpm I think has a 32-bit version number so like they could do that a couple of times and we'd be fine but the point here or the point that this is dynamic linking is an interesting one because there's a whole lot of fun stuff that would happen if we started treating building images like we treat dynamic linking and I think this is where I run out of slides yep so I can either show you my big outline or I can just hand wave at you and I might just hand wave and I apologize for the hand waving but here we go so one of the problems that we have with containers is that we ship them around as statically linked blobs they're kind of not statically linked there's a bunch of layers to them in the same way that there are layers to static linked binaries when you link in stuff you get you know you've got your compression library and all that so you can build them and then they're built and then you don't really know what's in them anymore and this is why we have things like container scanning when there is a CVE of some sort we have to go back and look at all of the stuff that everybody built and figure out which ones have the tainted code and then rebuild all of those if we were doing it dynamically where instead you are the image that you build your container is like an elf binary it's your code and some headers that say okay I need this version of this symbol I need this version of these python libraries the same sort of symbols that we're using in rpm as dependencies with some tweaking because we want them to be deterministic because as it turns out and I have a talk about this tomorrow I'm pretty sure rpm dependencies are now turing complete in you can use them to run arbitrary calculations which is not like the best thing that you want out of a dependency system it's kind of just like a fun party trick more than it is actually worrisome but it is like a thing to talk about but anyway point is if you have a reasonable dependency system like elfs you can pull in just the pieces that you need at process startup time as I understand it most container runtimes don't share memory when they build or at least with like the thin pool stuff when you're building and somebody correct me if I'm wrong when each container has its own block device backing its image which means that the block device is different for each of them which means that if you have a thousand containers using a thousand of the same copy of open ssl you have a thousand copies of it in memory and on disk which is like sorry it's not always like that okay good so there is some improvement but the last last I heard it was like that it was so sort of because people would just like well you need a block device so we are I am sort of stumping for a like fairly significant change in how we expect a system to behave we should expect them to behave more like we expect memory to behave like the expectation that you should be able to write to any part of your disk should be like silly because like we don't expect you to be able to write to any part of memory that's obviously silly the expectation or the expectation that you know you can write to the disk and other processes will be able to read it by default like that's also kind of silly we don't always want that that leads to a lot of fun problems this is where we get you know temp directory attacks where you have well-known file names and why we have to have temp trajectories at all is because they're a well-known if there's a well-known path and you assume that every process shares the same file system space well then you have problems we could eliminate that entire class of problems container like we can eliminate container scanning we don't need to do that anymore if we're automatically creating stuff that sort of we can eliminate yeah directory name attacks by dynamically creating your file system just for you so that other people can't actually look at your file system so that's the the world.so concept is basically let's look at how we put together our dependency chains let's try to winnow away all of the parts of rpm that are not deterministic let's try to make dependencies themselves sane and get to a place where we can just sort of mash the package content all together when we need it and do that to do that effectively back in when dynamic linking was invented you needed mmap to make it work right because just actually copying a whole bunch of memory into place that takes a while so then we're like hey we've got mmap mmap if you don't know how that works it's basically you say hey you tell the kernel put this library into this memory space and it goes cool okay whether or not it's actually in memory yet is not you don't have to care the kernel put it there when it needs it we could do the same thing with files we could be doing this rather than when you start up a container rather than building the entire container right then we give you an empty name space and when you try to look up stuff then we start putting stuff into it in fact we already have the capability to do this in the kernel it's just bind mounts so we need to do some stuff with rearranging paths to make this work but instead of actually decompressing a whole bunch of rpms and copying all their contents in you could just bind mount the contents of each thing that you need into your private space for your process this should take milliseconds it's going to it requires us to do some janky stuff with like where you look the paths that you look for libraries but we can do that we we have control over the entire system we know how to do all of these things so oh and what do I so the point that I'm making here broadly is that all of the things that we need to do to build a system where everything is reliable and deterministic and shared in a way that isn't like it is now like we have control over all of it we could just do this I just need people to like buy in on the idea of making a somewhat radical shift in how we put things together and I don't know that that meets with a lot of resistance most of the time what I get is what I call the MacGyver problem which is sort of like this if you've ever watched MacGyver you go back like MacGyver's you know super dude you like will fix things with bubblegum and duct tape and like saves the day with ingenuity and using things in unpredictable ways well so a lot of times if you go back and watch an episode of MacGyver the entire thing hinges on him like unlike somebody needs to get a piece of information to somebody else or else like the ambassador is going to explode or whatever it is right like I don't know why but like you have to tell the ambassador not to say whatever it is but like if you have cell phones the entire plot just falls apart right you're just like hey ambassador don't do that okay done episode over nobody needs MacGyver so like the problem is if I show up in an episode of MacGyver and everyone's like what are we going to do the ambassador is going to explode and I'm like just call him on a cell phone everyone's going to look at me like what and I'm like okay well all right so all you need to do right is build a worldwide network of radio towers and then invent pocket super computers and teach them to talk to the radio towers and then you can just call him on his pocket super computer and they're like we're going to go with MacGyver's plan because that's going to keep the ambassador from exploding now and it doesn't mean that like building this network is a bad idea but it means that it doesn't solve the problems that people are immediately facing and I think this is the other thing that we as a community and an industry have been doing is MacGyvering the hell out for years and not really looking at the larger problem of what it is we're trying to accomplish and like the larger thing about what we're trying to accomplish when we do image creation and what we're doing with containers is that we're trying to sort of extend what we do with memory already to the disk because we have a lot of stuff like interpreted languages like Python that want to in the same way that C does where it's like okay I might need this library so make sure it's available for me and the kernel will put it in your memory space if it needs it we could do the same thing with the file system it just makes sense to do it that way but there's that doesn't it's the MacGyver problem like it makes sense to do it that way but it would require systemic changes to everything we do not huge changes little changes but it requires like system wide changes and like you know I can give you parts of this like how you store everything on disk right you want to do it sort of OS tree style in a content addressable store so you get automatic deduplication so it's efficient and all that good stuff we have all of the pieces that we want to build this sort of a system we haven't put them all together yet and really it's just about getting everybody to get the idea in their head of what it is we're trying to build and then start building toward it so that's sort of what this is going to be about it was going to have a lot more slides and the well.so thing is basically once we get to that point once we have made these changes we could have a system where your program your thing maybe it has an elf header on it and it actually calls out to well.so which then you know dynamically constructs an entire file system for your thing and then dynamically constructs memory space and then it runs like we can build all of this all of the pieces are already there and that's about all I wanted to say I think about time so yeah are there any questions about any of this or do you just want to hear me rant about rpm more because I can do that all day so is there anyone doing this today any of the other distributions any other OS's no as far as I can tell no and I have a theory on this but no I've asked around to anybody to wherever I could find and there are pieces of it I've seen parts of like the making building of packages more deterministic you see a lot of that in NixOS and like doing atomic updating and quick generation of images you see from OS tree but they're still sort of constrained by rpm or their current system and so it takes a large like it takes a industry-wide effort essentially which is how elf got implemented in the first place and dynamic linking was that you know the industry was a lot smaller than and you could kind of throw stuff around a lot more easily and there aren't a whole lot of people companies making operating systems left and so there's a lot of MacGyvering and not a lot of like hey we should all work together to do this massive system wide change for the good of the industry containers sort of showed up because they scratched a niche but I haven't seen any I haven't seen the larger effort to attack the larger problem that I'm trying to describe and I'm sorry I should have repeated with the question the question was is anybody working on this yet and yeah I've asked around I haven't seen it yet have you uh have you like given any thought to what a migration path would look like I mean I hypothetically you get all this working and you say like get from here to there because it sounds kind of disruptive yeah um and that's that's a big that's a big one and I think the way that we do that is not as hard as we might think in that a well-designed a well-designed system that you know adheres to a good you know model of how we want distribution to work would by sort of default like eliminating some of the gnarlier parts of what we have now it'd be compatible with it in the abstract so like what what we did with our experimental image builder was we just imported rpms we strip out the parts that would not be allowed in our system but we just import content directly from rpms and um I think we also extended it to work with npm modules is that right okay okay yeah so yeah um the the idea is that you can any any model that is sufficient to make this work is also sufficient to or you can probably wedge existing things into it and so the plan is to sort of bit by bit look at how the whole system works next distributions are a loop basically right you have sources and then you make builds and you put a bunch of builds together to get an image and then the trick is oh when you did that build of a when you did that build that was inside a build environment which is an image so you've got this loop going of source to build to images just thing so you can take pieces of that loop and replace them one at a time with something that takes the same inputs and has the same outputs but maybe is different in the middle and then you can start cutting out pieces like spec files for instance oh boy spec files there's like four different Turing complete languages fighting for dominance in there and it's horrifying but if you had something that was you know data that you could compile into a spec file well now you've got something else that you can write that's a little more reasonable and we can still plug it into our existing stuff and then we'll wait what if instead of doing an rpm build and then putting the rpm content into our weird content store why don't we just build straight from this thing into the content store and we're hopping over rpm at that point so part of it is looking at the bigger model identifying each piece of the system and figuring out which ones we can most easily replace with something that's compatible but better which I know is an abstract hand-wavy answer but I hope that makes sense it seems feasible all right cool you may actually you may actually have touched on my question already which was that I've first I want to say great you've said it way better than I've been ranting for the last 15 years too right in working also with container builds there we're doing this thing most of the time where we're building rpms and then running yum inside containers which I need I need air sick bags right but one of the things in looking at it maybe ways to address that specifically was to take the rpm system and break it in half and say all right here's a build piece which creates artifacts and then we take them and we put them into an rpm or we put them into a container image you may have more sophisticated ways of thinking about that but that seemed to me to be a relatively simple way of reusing one bit but then I got to your then I got to your point of rpms no one knows how they work so right part is hard but sure we do want to maintain backwards compatibility so it's not so bad to be like I think you're I think that's the right instinct there is to say all right we're going to have something that could do and do things a new way but also builds the old way if you want it to we didn't get as far back in the sort of process as looking at the spec side of it I mean I have you know obviously big hand-weavy ideas about it but that wasn't what we originally started attacking we started attacking sort of rpm as a storage medium and the dependency resolution and image construction start part of it I'm concerned with problems yeah exactly and yes I also have concerns with that and yes let's talk the the efficiency in the is mind boggling in containers and it's a good solution but lots you know we just keep getting more memory faster computers bigger than me would sound like there's a good security viewpoint from yours that seems like a maybe the biggest selling point if that's the right direction yeah it depends on who I'm talking to but yeah there's some interesting stuff about that I mean if you think about all of the memory protection that we've added in elf over the years right like you could do an equivalent to address space layer randomization file system layout randomization where you have marked in your in your container every place where you call say bin bash and just like we do when you start up an elf process we go through and relocate all of those so instead of being bin bash it's some randomized path and when we link in bin bash into your image we put it at that randomized path so even you don't know where bin bash is so like your attacker can never run a shell because they don't know where it is and you don't know where it is all of the productions that we have for memory we can apply equally to the file system which is really interesting when you talk about interpreted languages that need the file system to get their libraries it's sort of a wonder we didn't do it before from my point of view but yeah I think for a certain crowd of folks that's a very interesting I have a friend who works for the DOD who is very interested in this very problem so yeah I think that's something worth exploring further and yeah again this is a general thing any ideas that this makes you have about what you could do with a system like this I'd love to hear because I know the parts that I care about having done like installs and upgrades for like way too many years but I want to hear about the other stuff like I have vague notions about security things but I don't know what specific classes of problems it would eliminate for your stuff so please come talk to me let's figure this stuff out anything else we have time for one last question so yeah you brought up bash and actually this is where I'm trying to write my head around this so if all processes only had access to I guess the files that they sort of owned sort of like how memory does how would like an interactive bash shell work like I teach an intro like the command line course right and so they learn about cd and all that kind of stuff so how would something like that exist in the environment you're thinking of in what sense like how would the what part of that would be tricky I mean would you get a bash shell that could go to wherever in the directory hierarchy that wanted oh well so your your bash process is going to have its own memory view so yeah you're going to want to have like we do with the sort of file system containers like there isn't you're going to need a set of tools and this is what I'm alluding to you're going to need a set of tools that look at the sort of your system is going to have a root content store that contains all of the possible packages everything on your system is going to need you're going to want something that's your sort of hypervisor login whatever a standard workstation type shell and that one's going to have been bashing there in the normal place your usual this is how you log in so like that that contained image is going to be a standard whatever but all of your other ones can be funkier if they want to so when you log into your your shell you're going to get the view that your shell sort of defines as what things should be in its file system but it's not going to get everything for everybody on the system right like if you don't use PHP at the command line you're not going to have PHP in there there's going to be another set of tools that you use to look at the global or not global but your system's content store to say oh okay I do have this copy of PHP here or whatever classification yes yeah exactly or like it's a get or a get style content store where it's just content to you you have a package and your packages are all in some big heap and when you start a process you pull the right ones out of the heap and you run them and your login can be in that heap but it's not going to have everything in the heap unless you really want it to you but why would you do that and there's collisions and stuff you can't really do that but anyway yeah no that's a it is an interesting point because it sort of disrupts the idea of logging into the system because there isn't a the system there is each thing gets its own view of the larger whole of the components that are used by all of them but there isn't one canonical thing that unless you're directly looking at the store itself did that make sense yeah okay cool and I guess that's all I got time for thank you all very much I really appreciate it the stock is going to start in five minutes so stick around yeah you need to use your own slides or do you want to use hey well thanks hmm yeah thanks uh I need to get my slide you might as well yeah I need to have VT VT VT VT VT VT VT VT okay what was that one one there was like yeah check it out there we go cool uh Okay, so you're a company inspector in my, uh, belt tank? Yeah. Wonderful. Okay. So, um, you just want to make sure I'm not all the way in pronunciation, right? Uh, wayman long. Wayman long. Okay. How do you want to read? I am. Do you want to read? I mean, I had signed up to, but you can go ahead. It's all yours. Oh, no, that's not. No, it's not. This is Ian's. If you're signed, that's it. I mean, if you're in a world where, like, viewers get to make action with him. Um, do you know how to do this, Ian? Yes. I mean, I've done that before. Okay, cool. The basic idea is just, like, figure out, like, if they want to do questions, I think the maximum ten minutes worth of questions to give them the timeframe and be, like, ten minutes before I'm going to do, like, this, five minutes, three, two, one. Then, uh, questions, make sure that they have the introduction. They, like, take care of having that problem. That's great. That's great. That's great. Okay. So, and I was introducing you. But you need to talk with him about how you want it. Can you guys hear me? At the back. Thank you. Uh, so welcome to DevCon to US. Uh, this is the systems engineering track. Uh, my name is Yash. I'll be your moderator for this track. Uh, the next talk is by Raymond Long on, uh, and the title of the talk is, uh, Spectrum Meldon, a primer. So I'll let him take over. Thanks. Okay. Um, good afternoon, ladies and gentlemen. Um, I think you all heard of the name Spectrum Meldon. Before, if you haven't heard it, then this is not the right talk for you. So, um, basically, uh, in this presentation, I'm going to talk about, uh, how the Spectrum Meldon, um, this kind of security bug come out, come about. And, and then what kind of, uh, vulnerability are actually, um, discovered in those, uh, uh, security bugs and what are we, we're doing from the open, open system side to mitigate the, uh, the impact, uh, there in the computer chip itself. So, um, the reason why we have this kind of, um, uh, security vulnerability is because, uh, we, uh, our quest for ever higher and higher performance in the computer chips. And in doing so, we are doing, we are making some shortcut that, uh, in the end, lead to this kind of security vulnerability that we, we talk about and, um, there, some people might be breast doing, being too aggressive. So, and finally, um, I will talk a little bit about, um, what the computer chip maker are doing, uh, in the future to try to mitigate some of the, uh, the problem that we currently have in our current set of CPU chips. So, first of all, what is Spectra and Meltdown? Um, they are new cars of CPU vulnerability, um, due to a capability in computer hardware for speculative execution. And they're up to now, they are a number of different, um, security vulnerability in this category. The most relevant one, uh, as follow, uh, we have Spectra V1, which is called the band check by path. Um, we will talk about each of the vulnerability in a bit more detail in the following side. Uh, this slide just give you, um, the set that I'm going to talk about. Spectra V2 is French target injection. And, and then we have Meltdown, which is, uh, we internally requires the variant fee. So they all have a number of systems. One, two, three, four, five. And after that, I talked about the speculative store by path, uh, which we internally call variant four. And then, uh, I was surprised a little bit about, uh, a new one called, uh, L1 terminal fault, which was just, uh, an embargo three days ago. So this is a new security bug that people were talking about, um, in the past few days. And there are also beyond the main one, there are also a couple of, uh, variant, um, that I, I'm not going to, to talk about because, uh, we will just take too, too much time. Uh, like the, we have a special V1.1 call, planted by path on the store, or speculative bubble full. And then there's another one called that's better. That allow you to do a, uh, to perform speculative attack from, from the network side. So we don't have to actually one on the computer itself. You just send some package to the computer and then you can, uh, extract some information out from it. But the thing with Nespatria is the, the, the big way is very low. So you can extract maybe a few by every hour or so. So yeah, although this is really possible, the amount of data that you extract is very, very minimal. So unless you can have, um, continues to do the attack for, for a few weeks or so, you won't be able to get much information out of it. Okay. Why we have this kind of security bug? Um, first of all, I would like to talk about, um, the different memory we would need in the computer hierarchy. So at the, within the CPU chip, um, the fastest memory you have is the register. You, you can usually assess the data in the register within one clock cycle. Um, and then beyond that, you have to assess the data from the cache. Um, in modern computer chip, they're usually about a three level of cache. You have the L1 cache, uh, level two, L2 cache and L3 cache. And then beyond that, you have to assess the data from memory. This table shows you, um, the latency or the amount of time you need to assess the data from different level of the memory hierarchy. So from the register, you'll need one cycle, from the L1 cache, maybe about four cycles. All this timing depends on the CPU itself. Different CPU may have different, uh, timing for different level of cache. Uh, this is just one example. I use a Haswell i74770 CPU as an example. So, um, for that CPU, the amount of time you need to assess data in the L2 cache is 12 cycles. And from the L3 cache, um, depending on where, where, where you are, the data. You also have L3. It's a one, it's one single set of cache shared by all the CPU core. So, where it is to, it's where the core is, and where the, uh, cache slide that the data have. Uh, the timing can vary a little bit. That's why it is, uh, it is, uh, you have a minimum and a maximum, so depending on the location. The further the data away is from the actual CPU core, the more time you need to assess the data. And for RAM, for physical memory, um, you actually need quite a lot of time. So, we are talking about, about 100 nanosecond in order to assess one data from, from the memory. And with a 3, 3.4 ticker per CPU, one clock cycle is about, uh, 2.3 nanosecond. So, you can think about 2.3 nanosecond compared with 100 nanosecond. It's about, um, a difference of 300. So, uh, you can do one operation, uh, in one cycle, and then the next operation, you need to get the data from memory, remember? And then you can see, you have to wait about 300 cycles before you can actually get the data. And in between, if you have nothing to do, the CPU will see idle doing nothing, basically. That's why, uh, modern CPU have a lot of ways to speed up the operation. Um, if you listen to the talk in the morning about, um, the cycle, the life of a CPU within, in a millisecond, the speaker talk a lot about the internal operation of the CPU and what kind of optimization they can done to speed things up and making it faster. So, in order to hide all this memory, you can see there are a lot of ways that the CPU can do. Um, the easiest way is pipelining. So, basically, um, you can pick up an instruction into smaller operation before you have a micro-op. So, you can pick up into four or five or even more. And then, in each cycle, you do one micro-op and the next cycle, do the second one, and you can pipeline the whole thing. So, you can do things in parallel. So, in essence, in extend the time you have to do... Oh, sorry. In essence, you extend the time that you have to do... Or maybe I can use a mic. Easy. In essence, you can extend the time that you have to do the instruction, but at the same time, maintain, still maintain a very high cost speed. But there is only so much you can do, pipelining. Um, so the second way, the CPU can help to speed up operation by doing, uh, our order execution. So, um, in the instruction stream, you should have the first instruction and second instruction and so on. Um, when there are one in computer, then, um, there are not actually one in the order of the instruction stream. The computer's chip will analyze the dependency between instruction. And if they find that the instruction... Uh, they have no dependency, then they may execute them out of order. Some instruction may get executed by the one and so on. Uh, so they can extract the parallel sum in your computer instruction stream. And beyond the order execution, you also have to do branch prediction because whenever you do, you hit a branch instruction. Um, modern computer software are usually very, very branchy, especially the general-purpose one. Um, you will hit a branch instruction, the FE 6 or 7 instruction. So when you hit a branch instruction, then the computer has... The CPU has to evaluate the condition that leads to the branch to see whether you take one branch or the second one. And depending on what kind of instruction you do before the branch, it can take a while to determine whether you are taking the first branch or the second one. So, uh, in most computer hardware, they have some static logic to do branch prediction. So you, you predict ahead of time which branch you are most likely to take them and then execute those instructions, uh, ahead. So, and then we go into speculated execution. Why is cross speculated execution because it's usually after the branch prediction? So because after the branch, if your prediction hardware predicts correctly, then the instruction after the branch will be retired after, immediate after that. If the prediction is wrong, then they have to follow all the instructions that you have done before and follow all the intermediate data and then go to the next branch, the second branch, and do the execution again. So, uh, just create a pipeline store and slow thing down. That usually don't happen that often, but when it happens, it will slow down the, the computer chips. So what, so all the instructions that are encoded, predicted, we will roll back. That means the, for, within the architectural state, like the content and register, the content memory, um, the, the conditional flight sexual, those information will, will be changed back to the original form. Um, so you won't see any side effects because of the misplaced branch, but there are other micro architectural states, like the content will cache. They are not rollback. Because within the, when you do expected execution, it's possible that you load some data from memory and those data will be stored in the cache. Uh, even if, at the end you found out that, um, you have misplaced the branch, the, the information in the cache won't be forwarded. They stay in the cache. And, and is this, uh, changes in the micro architectural state that lead to the, uh, all these speculated execution bug that we are talking about. So, um, now I'll talk about what is a side channel. A side channel is basically, basically a tag based on information gained from the actual implementation of a CPU. As I said before, uh, the micro architectural state, like the content of the cache, uh, will not be rollback. So those information are in the cache. And if you have a piece of information in the cache, um, what you can observe is that if you are trying to access the data there in the cache, it's much faster than when you try to access the data that are in memory. You have to load it first into a cache, and then from the cache to the register. It will take a longer time. So, um, for the type channel, typically the more commonly used type, uh, side channel is the timing information. Like the time you need to access some data, some piece of data. Um, there are also other side channels that are possible, like, uh, you are able to monitor the power consumption of a CPU or of a commercial system. You can infer whether the CPU is busy, or the CPU is idle, or things like that. Um, you have some external device that actually can monitor the electromagnetic radiation or even the sun coming out from the computer, and then infer some information on what the computer is actually doing. Uh, for, for the type of spectrum meltdown attack, the side channel they use is the timing. So, what the, um, um, the attack, what the, what you can do, um, in order to perform the attack is, uh, before you execute the instruction, you read the time, the current time from the, um, the TSC counter, and then after you execute the instruction, you can read the TSC counter again and see how much time has elapsed between the, before and after the instruction, and from there, from the actual execution time of the instruction, you can infer whether that piece of information you will try to assess in the cache or, or in the memory. And cache side channel is the more common one that are used in all those attacks. Um, other side channels are also possible, like in the, in the next lecture paper they talk about another side channel. Um, by using the, the time you need to execute an AVX512 instruction, you know, those are the new, um, SIMD instruction that are in the new Intel CPU, and those kind of instruction are very resource intensive. When you want those instructions, what happens is the CPU actually slows down, they reduce the clock speed, because they consume so much power, and usually when you don't use an instruction, what the CPU does is they turn off the circuitry of those, uh, those execution units that use the, that are required by the AVX instruction. And when you, that means that the first time you execute a AVX instruction, you have to start up the, well, you have to turn on the circuit and make one more bit and then before you can actually run the instruction. So that's why there's latency. So what the people are talking about is they look at, um, the latency in doing the execute those instruction and see whether the instruction unit actually, the execution unit have been turned on before or not. So those are the piece of information that they can use to infer whether some of the state within the CPU chips. Okay, uh, Spectra V1, this is the first, uh, Spectra V1 that attacked, uh, this code in January, I think in January 4th. Um, so basically it's just a simple branch instruction and if Thamon, and if the instruction, if the condition is true, you do a memory access. But the thing is, um, this kind of simple, um, instruction, uh, is what we call, um, Spectra Gadget. So those are, so you can make use of this Spectra Gadget to assess a secret information, uh, within the, the, uh, another way BX. So the branch predictor that the way that the CPU, uh, trained the boundary sticker is that, um, it will, you have some kind of internal state table and it will determine how much the branch had taken in the past. If the branch had been taken many times in the past, you assume that the next time you hit the same, um, instruction again, it will take the same branch. So past history, we use the past history to predict the future outcome. So you can actually train the predictor to predict that one branch is more likely than the next. So the, the attacker can have a piece of software that trains the particular branch prediction address to, to, to do a certain branch. And, and then after the train the predictor, it will then execute the Spectra Gadget. Um, so, uh, what the attacker can do is first send out all the possible data item, um, that are in the A way. So all this data will not be in the cache. Now, um, now you pass a value, X, let's say, a very big value, so we can access it and are very, very far out of the, the bound of the, of the, the way check. So the actual X value can be much larger than the length. But if the, if the length parameter happens to, you need to access from memory, then we're in, before the actual length can be acquired from the memory, there's a period of time that the CPU will have nothing to do, and so speculated execution kick in. So you assume that the branch will be taken, so you will go ahead to try to read the, the BX away, get the data in, and then use it to infer, uh, use the data from the secret, um, secret value, and then use it to, to undoubtedly access another piece of data that you know that, that can be brought into cache. And after the branch fell, um, later on you can then load each of the value from the A away one by one and see how much time it takes. Let's say the secret value is one, then, uh, you use speculated execution, the, the instruction, um, to assess the content of A, BX times 512, that means you assess the, uh, 512 value of the A away. So you'll get that piece of data, we will refresh into the cache, and after the speculated execution, you can assess each item, um, from, from A0 to A512 and then A124 and so on. And then you can determine which one you can, you can do it in the shortest time. Then that is the secret value. They're actually in the BX away that you want to, to get. So this is how they infer the value of the secret value, uh, by using speculated execution, but then looking at how much time you need to assess the way we are the cache, um, and to see, you see which, uh, secret value are the most likely one. And, and this reason why it's called a banter by pass. So actually it bypass a banter and do the speculated execution instruction to get the data, uh, from the secret value. So the way to fix the issue is, uh, you can, you can either insert a, a, a kind of low fence between the branch instruction and the A way assess to box speculation. Uh, a low fan is an instruction that is suitable for why to allow you to stop, um, doing any speculated execution forward until you, uh, until the result of all the period instruction have finished. So in this way, um, no speculated execution will allow it. Another way to, to fix this issue is, uh, using some, uh, what is called a data dependent except masking. Uh, like in the Linux kernel, they have a special macro go away in that no spec that allow you to, uh, de-reference and a way index. The way it does it is, uh, it make use of the band value itself, um, in the computation to produce another index value because that's computation depend on the the actual band that, uh, in the check. So the, and because of the data dependency, uh, subsequent instruction will not be, it will not be allowed until you resolve the dependency. That means after you get the band check value in, so that slow thing down, that, um, block the speculated execution from, from happening. Oh, I see what time is. Oh. Yeah. And yeah, the other one is, uh, spec band target injection, which is basically, um, uh, the, a lot of CPU have some kind of indirect branch instruction that you get the target of the branch address from register or from memory location instead of daily branch to a certain instruction. And that indirect branch value can also be, there are also some circuit in the CPU to speed up the, um, the indirect branch by spec, by speculated executing of where the branch address will be. It's called a PD, uh, the branch target buffer. And this, like the branch, uh, protection unit can be trained. So you can train it to assume a certain branch address. So if the branch address happened to be, uh, a memory in user space, then the, the computer will execute some, speculatively execute some instruction in, in the user space, which is very dangerous. If, if it happened in the, within the kernel, which we have, uh, updated to see everything, uh, you see every data in the, in the memory address space. So there are two ways to fix this when updated. One way is called the, the web pooling, the return trampoline, which is basically a software technique that kind of simulates, um, indirect branch by using a return instruction instead. So it is, uh, it's a dancer around the stack how to manipulate the stack so that you can do the branch by return instruction. And another way to fix that is, um, the, you can make use as best, um, as special MSL, there are machine-specific registers that are provided by medical code update, uh, to limit indirect branch protection. Uh, there, uh, what we call the IBRS, the indirect branch restricting speculation. When you turn on this mode, then the, the, indirect branch protection unit will know that, um, if you are in different privileged mode, like you are in user mode, uh, you do some indirect branch, and then after that you go to kernel, you're in the kernel mode, then it will skip, we'll ignore all the, uh, training data for the user space indirect branch and just focus on what you, what you have done in, in the kernel space. You'll also, sometimes you'll also first, uh, order the indirect branch protection data out, so that you won't use it, uh, within the, the kernel space. And, okay, let's go to the next one. Maildown. Maildown is actually, um, a different, a slightly different kind of, uh, speculative attack, a more, um, uh, a bug in the CPU itself, actually. So, for, or 886, Maildown will affect the Intel CPU, but not AMD. Uh, the reason is because, um, for the Intel CPU, they, they are doing a, a lot more aggressively, in terms of speculative execution. So, before a memory access happens, you are supposed to check whether you are allowed to assess the memory, that particular memory address before you move ahead to load the instruction, but, uh, in the case of Intel CPU, um, the, the protection check was done after they have already started speculative execution, or speculative loader, the memory data. Um, in the case of AMD, they, they do the comparison check beforehand, so they don't have this problem. Um, but for, for Intel, apparently they, they did it in a different way. Um, and the way to mitigate this, uh, problem is to limit what you can see in the user space. So, in the user mode, um, because in order to execute the Maildown attack, you, you have to be able to see the whole kernel address space, and, and to fix that, what we can, what we have done, um, at least in a linear kernel, and I think in all, in, in the other OS like window also, it's called the PDI page table or kernel page table isolation. So, when you're using, when you're in user mode, you only see a limited amount of, uh, kernel data, and then when you go call into kernel, they switch the pay table to see the whole set of kernel space. So, uh, but that switching of page table, uh, course time so that, that is why, um, all this mitigation will slow thing down. Uh, you, we have done some benchmark, it will slow down performance by about, uh, a few percent, depending exactly what, what you're, what you're doing. Okay. And, and then, this is the variant 4. It called the speculative store by pass. Um, actually, um, um, okay, let me skip this one. This is the latest one called the A1 terminal 4. Oh, oh, there's a bearer in the slide. Okay. Um, this problem is actually quite similar to mail time. It's again, due to Intel doing too aggressive a job in speculative execution. So, you know about virtual memory when you have um, um, they have a page table to translate the virtual address to the, uh, to the physical address. And, there is a bit in the page table to define whether the page table entry is valid or not. If, if the page table entry isn't valid, it's supposed to generate just a simple page form. But, um, what Intel does is the, they, they, even if the page is not valid, they will assume that, um, the address portion of the page table entry, uh, correspond to the physical address of the, of the data page that you want to access. And actually, speculatively load the data. And, once the data is loaded in the cache, you can use the, cache, uh, time sign channel to retrieve the data. But that won't happen when the data, uh, is in the A1 cache. If the data isn't in the A1 cache, it will just, it won't do anything. Um, here is, um, kind of a shortcut that they use to speed up the performance, because, maybe they think that, um, there is mostly the case that the, where the data will be. And, for this one, um, there are a number, I mean, there are a number of mitigation, there are possible, um, we, we in the kernel, we use a technique called, PT, table inversion. So, for those entry, with a, with a, non-reader address. Uh, usually what is stored in the page table entry itself, is some kind of metadata. Um, they usually start from zero. And with PT inversion, instead of starting from zero, we start from the, very top. The, the last, uh, possible value, going down there. Because in, in many cases, you won't have a fit, a system that have much, many, many, they are allowed by system. For instance, most models, uh, 86VU, uh, allow 46-bit, um, physical address space, which can say to 64 terabyte. You won't need to find a system to 64 terabyte, man. The most common you will find will be, will be a few terabytes at most. So, you start from, from the top. Then, um, the, the address stored in the principal entry, won't match any existing physical memory. So, you won't do any, you won't be able to, to do speculative execution on, for those address. But the problem is, within a VM, um, the, within a VM, you have a host of virtual memory, and you don't trust what the VM is funding. The VM may contain a host kernel that use this, um, medication to attack the host. And, so, in order to, to mitigate this issue, um, there are two ways that you can do. You can, uh, within, with the hype, when the hype processor needs to go back into the VM, you have to first the content of the L1 cache. Um, but even then, it's not enough, because we, if the CPU supports SMT, so you, you can have two threads, um, serving the same core. Uh, they also share the L1 cache. So you want, one thread is wanting in the VM, another thread is wanting in the host, then the, the one in the VM transpires on what the, what the host, the other thread in the host, can write into the L1 data cache. And, so, so the only sure way to, to avoid the problem is actually to disable SMT. Okay. Um, I'll, I'll stop them here, uh, as the time is up. Uh, any questions? I think we have five minutes for questions, right? Yeah, so we have three minutes for questions, so I can do, we can do three questions, so I can just stand here, and I take the mic. Okay, sure. Questions? Um, this will be your three minutes. Do you want something, to explain for three minutes? Okay. Um, this is actually the last slide. Um, looking forward, the CPU manufacturer are actually, uh, trying to fix some of the issue in silicon. Right now, what we have done is sort, um, mitigate the problem in the software, which is kind of like a workflow one. Uh, it's not a permanent solution because it's slow, thing down, and make things more complicated within the, the operating system itself. So, this, at least for the meltdown and special V2, I think for Intel, uh, in the next, uh, next generation of CPU, the, the cascade, to have silicon fix for, for this, for those two, and also for the, for the A1TF, uh, vulnerability, and for the SSP also, the only one that harder to fix is special V1, because, um, branch prediction is, is very fundamental to, um, how all these, uh, speculations, how to, fundamental to improving the performance of CPU, and it is pretty hard to, that in second, they may have provided, some way to make it harder to, to exploit the vulnerability, but, we still, probably need to, um, find all the, he should, all the, special gadget in the, code and try to minimize the use of those. And also, the one thing is, um, since this is a new class of CPU vulnerability, uh, and there are a lot of researchers, I think they're researching in this area, to, uh, they're, they're more to come in the near future. And, okay, um, that is the end of our presentation. Thank you for your time. Yeah, thanks. Thank you. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks for the chat, Thank you for what you're doing. Thanks. Thanks. Well, also, I'm going to go in first place. Ah, yeah. Yeah, so I'm going to go in first place. The version of the word is the word. That's right. So who's here? Yeah, good. Yeah. Hello, everyone. Can you guys hear me in the back? Yes. Awesome. So welcome to DevCon. This is a system engineering track. And the next talk is going to be Ignoring Alert by Sanjay. So I'll handle it. All right. So we'll just give maybe 30 seconds for everyone to settle down. And then we'll start. Okay. So let's start. So I'm Sanjay. This is joint work with Oli Draper, who's right here. And so as the title says, it's Ignoring Alerts. And to set the problem up, let's look at a simple example. So if someone there can't hear me, just raise your hand and I'll speak up. So this is a toy example. Let's say you have some system you're monitoring. And the yellow line is some metric you're measuring. So it's very well behaved. It's a sine curve. You can do magical things if you're doing artificial things. And let's say you put a threshold. You said this should never exceed the value of three. That's a red line. And of course, this is a toy example, but this happens all the time. We monitor systems, whether they're rockets, they're computers, finance. Time series data is everywhere. We monitor this. And this is okay. Every time it exceeds, it sends you an email and says, hey, my threshold got exceeded. Then one day this happens. This doesn't mean something's wrong. You know, maybe this is... That's okay. Oh, it should be okay. But that's better, right? Like it's a mysterious voice. And let's say the system changes, the behavior changes. That doesn't have to be bad. So I'll pick an example from finance. I'm sure many of you have money invested in S&P 500. You wanted to go up. If you built a model 10 years ago and you said, oh, the value if it reaches, let's say 1200. Let me know. I'll sell because you think it's a bubble. Well, today it's for 22,000 or something like that. So it keeps going up. It's not a stationary time series. And so you cannot put hard thresholds which don't look at new data. And so if you do that here, of course, beyond a certain point, every single second you're getting an alert. And this happens all the time in monitoring systems. You get a static threshold or static rules, the data changes, the system changes, and then your alerts come in and they don't mean anything and you ignore them. What's the use of that? So what we would like to do instead is to learn from the data itself. So as data changes, as the system changes, can we come up with new rules which are automatically updated? And can we use these rules to basically monitor what's going on? Can we look for signatures of events that are interesting? They can be bad things. They can be good things. And machine learning is a buzzword nowadays, but many of these techniques which are 50 years old, 20 years old are easy to implement. Data libraries out there, they're fast to compute. And some of them work extremely well in finding hidden patterns, patterns that are not obvious to us at first glance. So these are some applications for our universe. For sys admins, monitoring performance of systems, detecting errors, optimizing with respect to various metrics you want to minimize downtime, you want to maybe save energy, maybe it's some system on a rocket. You don't want to use all your battery power. For programmers, of course, performance analysis, detecting bugs. And in general, in any root cause analysis, we generally find what happened, we record the bug, but we never use it again. Can I record various aspects of the bugs I found, the root causes I found and feed that to a system that can then detect new ones? It can say, look, I've seen something like this in the past. It looks like the same thing is happening again. Maybe it is, maybe it's not, but it's a helpful tool. So this is one of the broadest problems you can think of, anomaly detection. Anomaly doesn't have to be something bad. It is something improbable, something unusual, something that wasn't seen before. So in this case, you again have some nice yellow curve, it's periodic, but in the third peak, you see a slight jump. And that's unusual. Of course, as before, there could be a static threshold, but like we said, if the data changes, you might not necessarily catch that. And the goal is to catch that yellow bump. Of course, real systems are more complex. It's never that easy. And there are various ways of doing this. There's something in machine learning called unsupervised learning, which is when you just say, here's data. There's a lot more work that goes. You just don't feed the data, but here's some data. Can we look for unusual things? There's something called supervised learning where you build a model. And the model says, having seen some of this data, what do I expect in the next one week? Let me make that prediction. Now let me see what actually happened today, tomorrow. Let me get this new data coming in and let's compare those two. If they're different, and again, I'm being imprecise here in my language, when I say they're different, I mean statistically different. So not just that the delta is more than zero. But if they're different, it means one of three things. My model is wrong. Something unusual happened in the data or a combination of both. But that's very useful signal where you say, I know how to model a system. I expect certain things to happen. And the thing that happened is really different from what happened. From what I expected. So let's look at that. So this is just an example, right? Even here you can see as the orange area increases. So that's the delta that you see between what you actually observe and your model's prediction. Initially, you might say, oh, it's only different by 1%, half a percent. Who cares? And then it grows and it maybe becomes 50%. So at some stage, you of course start caring. This is another example. I actually don't remember what this data set is. It's airline traffic. From 1920 to 1963, I think. And as you expect, it's growing. It looks linear. There's some periodicity there. And the yellow curve is what we used to build our data. Not the yellow, the blue one. And the yellow one is something that we don't touch at all. It sits there. Or in machine learning terms, if you have heard these terms, the blue data is called training data. And the yellow data is called test data. And so this is just one example of a model. And so I'm not going to go into details of what Sarima is. The point of this talk is not to give a comprehensive overview of machine learning techniques or how to actually build a model and test it. Just to show some examples. So the Sarima model basically looks at the blue curve and learns a few parameters. And it can use that to make future predictions. So you make future predictions and you see the colors look a bit mixed up. But the green are the predictions. The yellow is the original data. That doesn't look that bad, at least visually. You can zoom into that data set. And what you see is the stuff on the left. So yellow is what you actually observe. Green is what you predicted. And Sarima is a technique that can learn periodicity. And of course here, if you look at the first peak here, it looks pretty good. As you go further out in time, they have a shift. It clearly didn't learn the right periodicity. There's some delta that adds up over time. And so there are already a couple of interesting things here. Number one, most models decay in time. So initially they work well. But as time goes on, the system changes a bit and the models don't work that well. So you have to constantly keep training them. The second one, of course, is you can look at the difference. And as you look at the difference that grows, which is just expressing the fact that the model decayed. But what you would do is you would look at the, let's say, from 1950 to 52. And you would say, well, if the gap increases to something that's uncomfortable for me, let's say 10%, then I'm going to look at this. So that can be an alerting system in this case. So any questions till now? So for this talk, we wrote two simulators. So Uli with all his knowledge wrote two artificial simulators that would once simulate a network. So you have a bunch of hosts. Some belong to your intranet. Some belong to the internet external hosts. And they have some hard-coded patterns, some distributions that we put in for rates at which they connect to your machines. And you introduce the network topology. And we want to detect intrusions. So we can, you know, it's a simulation. By hand, you can go and put in some hostile hosts and let them do something bad and see if you can detect them in the data. The second one is process scheduling. So the idea is, let's say you have two processes. It's a simple computer. There are end CPUs, MIO units. You have some process. And I'll go into details of how we define the process. You run this process in isolation and you see how many CPU cycles it spends, how many IO cycles. You run another one, which is completely different. What you would like to do is run both of them on some given hardware where you are packing them maximally, right? So by packing, I mean you want to take, you want to utilize both CPU and IO units as much as possible without penalizing any single process a lot. So you don't want their runtimes to double. So every, your clients are happy, but you want to use all your resources. So a trivial example is let's say you have two processes, each of which does 50 CPU cycles, 50 IO cycles, but at alternate times. And I want CPU one IO, I can perfectly pack them. When my sister processes doing a CPU operation, I do an IO one and then we flip. So that would be perfect packing, but of course in general, that's not possible. So this is a network problem. And again, as I said before, this is not comprehensive at all. So I'll just show a bunch of examples. Some are interesting techniques. Some are very simple techniques. The simplest thing you can do with a graph is you, you represent, with a network is you represent it as a graph. Every host is a node. If two hosts connect, there's an edge. So you see all the hosts and the edges. Generally, these are directed graphs. I might connect to you. I might send packets to you. You might not. You can of course think of it as an undirected graph, but there are no directions to the arrows. You just have an edge. And of course you can plot graphs in various ways. This is something that's called a spring layout. Let's not worry about that. The whole point is to take each node, map it to X5 coordinates and plot it on a 2D plot. And at least in this case, you can naturally see five clusters, five groupings. So you see the four fans and you see something in the center. All the, I don't even know what to call that color, pink, salmon nodes are external hosts. So they are on the internet. All the blue ones are on the intranet connecting to these internet hosts. So they are internet-facing hosts. And all the green and yellow ones are back-end and control machines. So they're your infrastructure. And so now what do you do with the graph? I can look at it and see something, but I don't want to look at 10,000 graphs a day. I want to do something in an automated way. And there's a very nice correspondence between graphs and matrices. So you can represent graphs, those graphs by matrices. By graphs I always mean nodes edges. And the simplest matrix is the adjacency matrix. And the idea is if you have n nodes, you identify, you give a number to each node. So from 1 to n. And then you create an n by n matrix. And if node i and node j or host i and host j talk to each other, you put a 1. If they don't, you put a 0. So now you get an n by n graph with 0s and 1s, which is a very flexible tool. Because maybe you don't want to put 1s. Maybe each edge has a weight. It's the number of connections per day. So I can put that number into the graph. Maybe it's an undetected graph, in which case the matrix is symmetric. So this in general is a very powerful correspondence. And then there are derived graphs. So there's something called the Laplacian. We won't go into that. But what it does is it lets you explore complex graphs through the language of linear algebra. So you can look at eigenvectors, eigenvalues, if you have worked with those kinds of things. I'll state some, you know, it sounds like a mysterious result. But there's something called the Laplacian of a graph. So it comes from the adjacency matrix. Think of another matrix. You can compute something called the eigenvalues and eigenvectors of this graph. And if you look at the number of zero eigenvalues, that tells you how many connected components the graph has. You can similarly compute all kinds of interesting graph properties by looking at the eigenvalues of this matrix. You can look at its eigenvectors and visualize the graph in different ways. And so again, I won't go into details here. If you are interested, please, you know, look for Uli or me after the talk and be happy to talk about it. But the key takeaway point here is, given a graph that are interesting matrices and you can compute, you can use linear algebra to compute useful things. And they let you monitor your graph. So this is, you know, you have to put some nice pictures. This is an adjacency matrix heat map. So what you see are the some of the rows in the columns light up. Each row and column corresponds to certain nodes. And so by looking at this, you can already see something like there's a group here, which doesn't really talk to each other. And if you see something that lights up, you can say, okay, that host is talking to a lot of different hosts. As picture, you know, as a picture, it's interesting, but not that useful. The linear algebra really makes it useful. So going back to the original graph visualization, we had five clusters as humans, we see it instantly. But imagine a very large graph and it's complex and you don't want to do this manually. So what you would like to do is have an algorithm that says, these are the five groupings or 10 groupings I see in my graph. And for each one, can I find something unusual? And there's a technique for that called clustering and clustering by itself has many variants. So what I'll show here is called k-means. And rather than talk about it, it's easy to show pictures. So again, we have a 2D plot, we have three clusters. And what you do is you say, so there is a downside, you have to initially tell it how many clusters you think there are. So here we'll say three. Of course, there are automated ways of scanning through a bunch of number of clusters and finding the best one. But here, let's say we pick three. So I don't know if it's easy to see, but now there are three centers. So you see diamonds, blue, purple, and green. And what the algorithm does is it says, I don't know where these clusters are. So I'm going to drop three random points. So you drop them. Then every single point in your data votes, it says which of these three points is closest to me. And if it's the blue one, then I'm going to color myself blue. So all the ones that are closest to the blue one set their blue similarly for purple and green. And then the second step is you take all the blue points and you find their mean. So take their X positions, find the mean, take the Y positions, find the mean or the average, and that will correct all your positions. So there are two steps, right? You drop three random points. You say every data point votes. It says I'm closest to that guy. So all of us who are blue will take our average and move our point. And then we repeat. Again, we all vote. We say who's closest to which diamond. And so now you get something slightly different. And then you compute the mean again. And I'll just go through these a couple more times. So as you see, it converges. And so what happens is they all keep moving the centers till the center is basically within your cluster. Now I'll throw a couple of buzzwords out here too. There's a whole field or sequence of techniques called expectation maximization. This is one of them. In practice, you have to run this 20 times, n times, and find the best solution. But again, that's super easy to do. You just run it a few extra times. Another thing, if you're interested, K means it's good for spherical clusters. There are other techniques called spectral clustering, for example, or DB scan, which are good for other things. But this in general will automatically scan over your data set and tell you what the clusters are. So just to make sure, this is an illustrated data set for two dimensions. And you might say it's obvious. What we are working on normally are data sets with hundreds or thousands of dimensions. And anyone who is here who can imagine more than three dimensions, please see me. And then we do what we were talking about earlier, which is you and it's hard to show this. We actually did this on the simulator and you can put your hostel host and they just pop out just because they have a different connectivity matrix. They connect to more hosts or they connect more often. But what you do is you look at each cluster and you say, who's in the minority? So in this case, you have the cluster ID, you have some host address, and you take some property of the host. And this is network specific. You can do this for other things. And you say maybe the property I'm looking at is who is internal, who is external? And you look for the minority. So in other words, we were talking about anomaly detection, which is looking for something improbable. You can use key means for anomaly detection. You have a cloud of data in three, four, 10 dimensions, in our case, two dimensions. Do clustering, look at each cluster and say, who's in the minority in my cluster? And those are the anomalies. So same point. Those are the anomalies. And in our case, at least the hostels just pop out. This is again, there's no machine learning here. This is just visualization. What this thing is a graph where the y-axis is time. So time goes up. So you have your graph. You have all these hosts talking to each other. At every point in time, you make one of these horizontal lines. And what the horizontal lines are, the traffic from an external host to one of my internal hosts. So there are four internal hosts at the internet facing. And those are the chunks you see. So this purple patch is one internal host. Same thing for the three orangish patches. And what you see is traffic as a function of time as you go up. And the reason the first patch is purple is you have this one host connects to twice as many clients, but gets half the data from them. So it gets less traffic. Those three get the same amount of traffic statistically. And you clearly see some bands that stand out. So here, here, here, and the last one. And again, there's nothing, no machine learning here at all. It's just visualizing traffic data as a function of every single incoming connection for every internal host. And it already pops out. So sometimes just visualizing your data in a simple way is super helpful or in multiple ways is super helpful. This is another one, not a very pretty plot, but the x-axis is again time. The y-axis is traffic. And here you see this is from the analysis we were doing. So you see two hostile machines. The green one is a scanner, so it's scanning ports. The red one then looks at a potential target and attacks it. So it has less traffic. And again, this is a simulation. It's cleaner than real life is. But you can do another interesting thing with this, which is to project it in two dimensions. So now the next technique we'll talk about is taking data in high dimensions. And by high dimensions, it simply means, instead of having two numbers, you have 20 numbers. That's 20 dimensions. How do I visualize this in two or three dimensions? That's again a whole subfield, but we look at one technique, the simplest one. And it has a fancy name. It's called principal component analysis. So we look at an example. You have some three-dimensional data. You have all these dots. Each one is one data point with three coordinates. And they have different colors. And you can look at this from different angles. So you have three different angles where you just rotate this thing and you look at it. And you see if you can find some way to separate them. But you don't, right? Like some of them, the last one, the green and the red overlap. The other one, something else overlaps. So what you would like to do is basically find the right angle to look at it from so that you can separate this data as much as possible. Or in other words, at least in three dimensions, you want to find a two-dimensional plane so that if I project everything on this plane, and by projecting, I mean just this operation, just drop everything perpendicularly on the plane. And I would get a picture like this. I want this picture to have maximum variance or variation. I want the data to spread out. So the question is, what plane do I think? And this doesn't work just in three dimensions. You can have 100-dimensional data and you want to find a 99-dimensional plane where you can project this data. And one of the techniques is principle component analysis. It finds that plane for you. I don't want to make this more mysterious than it sounds. Finding the plane is simply finding the eigenvectors of a certain matrix that are connected to the data. It's the covariance matrix. You find the eigenvectors. That's the plane. But don't worry about that. Basically, when you then take that plane and you rotate it, you drop it on a two-dimensional table, you see something like this. Now, again, sometimes it works, sometimes it doesn't work, but it's a very powerful technique. So in the case of our traffic analysis, when you plot it on two dimensions, all the external hosts that are not hostile form those two blue blobs. And the scanner and the attacker, there are only one of them, one of each, they fall in completely separate spaces. And this is very powerful. It's very hard to visualize time data, time series data. I mean, you might have 2,000 data points per time series. It's hard to look at that. Doing something like this instantly tells you, summarizes each series in one point. It tells you what's unusual. And again, a few more buzzwords. So there are other techniques like this. There's something called a self-organizing map, a T-SNE, so T and S-N-E, many others. But PCA is a good go-to technique. How much time? We have to break after that. Okay. So the second simulator we tried was the process simulation. And the idea here is let's write a simulator where you have N, CPU units, M, IO units, and your processes running. How do you model processes? Well, they have four states. They compute an IO. And if they can't find enough compute, CPU or IO units, they're waiting or they're idle. So you have these four states. And what you do is you pick a state. If I'm a process, I pick a state, compute, and I pick a random number from a distribution. It can be normal. It can be something you pick. And I stay in that state for that amount of time. So I might stay in CPU for 50 cycles. Then there's a probability of me to jump to compute, to IO, to idle, to something else. And I repeat this. So this is technically what's called a Markov chain. You start in some state. You stay there for some time. That time is randomly generated. And then when that time is done, you jump to some other state randomly. It's a very simple model of a computer. Very basic. And like I said before, what we want to model is let's pick a simple example. Two machines, one CPU, one IO unit each. I run a process in isolation on, let's say, each of them. I measure the number of IO cycles, CPU, total time. And I want to find out what happens. Can I predict what would happen if I run them together on the same hardware? Can I pack them in a tighter fashion? And so this is the raw data that we generated. So the x-axis is you run each process. So if you pick two processes, run each 100 times. There are random elements. You have to average them. But the x-axis is the mean compute time and mean IO time. Time spent in the compute state, time spent in IO state, or in other words, time spent doing useful work for both processes combined is on the x-axis. And on the y-axis is if you plot both of these processes on the same machine, how many weight cycles are introduced? So the worst case scenario is if I am 50 CPU and 50 IO units, fully is the same. We go on the same machine. We run 400 cycles in all because we just can't multitask. So that's what you see on the y-axis. So again, x-axis is total work in isolation for each process. Y-axis is when I plot them together, how many weight cycles do I introduce? Ideally, that's zero, but it's not. And again, with the caveat that this is a simulation, there are already some interesting features. You see a big blob on the right side where they take a long time to run, but they don't take that much. They don't have that many weight cycles, so they seem to pack pretty well. While all of these, it's almost linear. So there's some interesting pattern here, but we would like to predict the y. And so one of the techniques for doing that, and so as you can see, I'm using simple examples to motivate some machine learning techniques. So the examples are not completely rigorous. We didn't do all the checks that you should do. These are just examples. So a very useful and very fast and powerful technique in machine learning is called random forest. And it's basically a combination of something called decision trees. What is a decision tree? It's a sequence of if-else statements. So you give me some data, you give me x, y, z, which are so-called features, and you tell me, so let me talk about this example. You give me two points, x and y. Every point, depending on where it lands, is red or green. If it's in the center, it's red. If it's outside, it's green. I want to build a sequence of rules so that if you give me a new point, x and y, I can tell you whether it's red or green. Now, of course, this is visually a trivial example. You can say, oh, why can maybe draw a circle and see where it lies. But again, like Uli said, data can be more complex. It can be high dimensional. This is just an example. And what the decision tree does is it will construct a tree like that, just if-else sequences where the thresholds in the statements are learned from data. And so what the plot on the right shows is you just go across the grid and everything that would be predicted as green is in the yellow. Everything that would be predicted as red is in the blue. And so, of course, because if-else statements on x and y, you get vertical and horizontal lines. And we are not claiming this is the best technique to use for time series or for regression, but it's one of the first go-to techniques. Just to give you a flavor of how messy this gets, we actually took the raw data, the process data, and constructed 123 different properties. Time spent in compute, number of transitions from compute to IO. Then you take the time spent in compute and you bucket them in bins. So you try all these things. You feed it to your algorithm. And this is what you get. So on the train, so you take some data and you say, my algorithm only sees that. You say, all right, I'm 9% off on average. So the error between the prediction and the actual values is roughly 9%. But when you actually test it on data you haven't seen, you get 50% errors. So number one, this is something called overfitting. This is severely overfitting. Your model is memorizing your training data that it saw, then it sees something different and it conks out. But more than that, it's pretty bad. Then you reduce the data you use. You don't use 123 properties of each process. You use two, two for each. So four and all. And then as you see, on the train set, your accuracy goes down. It is still overfitting because there's a big difference between train and test. But your accuracy is much better. 20% isn't that bad. If I could predict on average with a 20% error that if you run end processes together, this would be the total runtime, I would be happy. I mean, that's a very rough crude model to help you schedule a distributed system in general. And so the key takeaway point here is one, of course, set up many techniques like decision trees and random forests. But the second one is more data is not always better. Sometimes having the right data, which is where domain experts come in, is far more powerful than just throwing data at something. And so lastly, I'll just show you the, these are slightly messy plots, but on the left is the model that uses all the data. So 123 properties of each process. The blue points are the original data. The red ones are the predictions. And you can see it really off here. So everything should actually have a vertical line. So this point was predicted to be here. Even here, you're completely off. All this stuff should be here. What you see on the right side is the model with just four properties in all. And of course, it's hard to know which one's better, except you can look at specific corners and say, oh, it's doing better on the right corner. There's still some bad things happening here. But this is the one that gave you better performance on average. So that's it. Again, the key takeaway from this is not that this is the most accurate model or the most accurate simulator, but that it is useful to use some of these machine learning techniques. And there are many choices to help you automatically monitor systems. So the thing about this is that the techniques we have deployed and what we did there are deliberately not specific to this purpose. These are general purpose techniques and you collect data anyway. So what we're suggesting is that instead of trying to guess what is going on there, you try out some of the algorithms like this and let the algorithm do some choice. It's not the dumb process. You don't just throw all the data you have at it. So this is what this example is supposed to show. But instead, you're trying out a couple of things and then the model itself, once you figure it out, you can just retrain over and over again. Usually for whatever, if the complexity of the model is not too high, so the learning doesn't take too long, you can say every single day I'm learning it new and learning the parameters. So that then if you're catching up on trends, remember some of the first slides where you have a trend in the data, so you want to catch up on that. So these kind of things work well as long as you're not putting your own bias in the game. So as long as you're dropping that you know everything about it, you should know something about what is important, but you should not claim to know that, yeah, here, this will never exceed this internet traffic. So just imagine things like, well, back in the days, we said, well, if you ever have more than five megabits of network traffic in a second, something is wrong. Well, guess what? Nowadays, we have 100 gigabits system. So these kind of things will, if you put this in the system like this, they will automatically learn. And there are many techniques. So you ask your friendly neighborhood machine learning expert about these kind of things, they will probably be happy to help you along with this. Any questions? So just for reference, Sanjay and I used to give classes in machine learning the very, very basic overview class on all the techniques only took on something like 40 hours. And that's not any, not even going into details. So that's not something which you'll learn overnight. That's, well, three or three year four year process for you to catch up. So don't expect that you can be able to do these kind of things but you can find someone who actually can help you with that. I like that. So maybe everyone just be quiet and... All right. So thank you everyone. We have a 20 minute coffee break now and the next talk is at 3.50. So thank you. Good job. I forgot the human bias part. That's okay. Except when I left it on and, you know... I didn't realize that. It's on. It is. Hello. Check my work, but I think I got it. Do I? Yeah, that's good. Do I? We are. Such a huge audience. Exactly. I know that's just going to be Karen halfway up there on the right. I think it's the Octo-96. You brought it? Why? Why not? Can you give it to me? No. I'll give it to you if you do all the work. All right. Okay. I'm Peter Robinson. I work for Red Hat on IOT and all sorts of various other bits and pieces. My co-presenter over hiding to the side here can introduce himself and actually get into the line of the camera. Yeah, hello. I'm Robert Wolfe, 96 boards community manager for Lenaro. And... Yeah, so we're just going to talk about AI and machine learning using Fedora as an operating system and stack on the 96 boards AI platforms. So Robert will give a brief overview of the 96 boards ecosystem we'll cover off some of the various hardware for AI and machine learning, how we're going to do AI and machine learning on Fedora and some of the pros and cons and issues we are facing in dealing that and how we're going to try and tie it all together between the 96 boards AI initiative and Fedora as a whole to give a cohesive sort of end user opportunity across a whole range of hardware. Great, yeah, so as Peter mentioned I'm going to first talk about 96 boards the ecosystem in itself and then how it relates to Fedora and how we're going to move forward with 96 boards and Fedora. So the 96 boards ecosystem I'm guessing at least some of you have heard of what 96 boards is but basically it's a single board computer or an open hardware specification I'm going to kind of quote this right here but the hardware spec is open not the hardware itself, right? So it's an open hardware specification and I'm going to talk a little bit more about that in just a second but first I want to give a little bit of a history of what Lenaro is and what 96 boards is and why we started 96 boards. So Lenaro as you can see there founded in 2010 originally was created to reduce the redundancies and fragmentation in the Linux on ARM ecosystem and there was a big problem that they faced in this five-year gap which was basically trying to provide the development hardware, the platform for the developers to work on and back in the day, not even that long ago trying to get your hands on this ARM-based hardware was first of all very took a long time to get it and second of all it was very expensive and so a bunch of big companies got together and said let's design a specification let's call it 96 boards and we're going to make it easier and cheaper for people to get their hands on this hardware that they need to develop on that's when 96 boards was born and so you have a specification SOC agnostic specification that allows vendors to come together build boards out so that you know you can develop on them and do what you need to do for reasonable price. Alright so as I mentioned right now we have three specs you have the consumer edition, the enterprise edition and the IOT edition right now we're trying to work on a strong hardware and software story this is basically creating the layer of hardware for you to develop on the operating system which we're working with Fedora Peter right here in Fedora to get a strong software story at an operating system level and then after that becomes the application level so that you can develop on multiple system on chips without really feeling too much of a difference when you're transitioning your development Our model is a partner based model so 96 boards as a whole you have a bunch of partners that come together industry partners and they are the ones who decide which direction this takes. Peter in fact is one of those steering committee members representing Fedora, Red Hat on the board there but there are a lot of members we have over there Yeah so I represent the Red Hat ecosystem so primarily at the moment Fedora REL, Sentos primarily more the community distributions and one of the things that Fedora in particular provides is that across the boards that we currently support in Fedora and there's going to be a bunch of extra ones coming along in F29 is that it's the same kernel the same experience that you get on x86 so same kernel SE Linux would grab various other bits and pieces so it's just a unified experience doesn't matter what device you're running on Yeah it's a big deal because you know you have partners from all different types of all different branches of the industry all come together to kind of try and figure out the best way to build this hardware and bring it to the developers so it is not just one person saying this is how I think it should be people coming together and saying this is what we think it should be and then testing it out within the 96Words ecosystem we have a pretty vibrant and growing community and it's not 96Words isn't that old but we are seeing a lot of traction in the community and it is growing so if you do have any questions you would like to find a way to get a hold of us or how to get involved in the community you can reach out to me afterwards Initiatives this is one of the ways that we push forward with 96Words kind of launching initiatives and since we are a partner-based company a partner-based department we seek a lot of help from our partners you know reaching out to Red Hat Qualcomm, Xilinx, Avenet, Aero basically every single person that's involved every company that's involved with us we try launching initiatives in parallel with these other companies so we kind of push forward together one of these initiatives for example is our Mezzanine community this is an open source repository that we pushed out there that allows people to grab on to all of these design files whether you're building in KaiCat, EagleCat, Altium and a few others we put out the templates there so that then you can build on add-on devices and of course 96Words AI is another one of those initiatives 96Words AI is another one and I'm actually going to talk about that right now so 96Words AI we are pushing out to basically create a compartmentalized section of 96Words where we're saying these boards we think are the best uses for AI and so we basically chose a few right now we're calling it kind of like AI compliant for this purpose and we want to create the ultimate software hardware application story for these boards now on the top you have the Rock 960 in the middle you have the Xilinx which is the Ultra 96 it's this one here quad core 64-bit processor fairly high-end FPGA and a couple of real-time capable co-processors as well on board all in something about the size of a credit card so fairly powerful fairly small literally pocket size so it's a cool little device and as you can see looking at the spec these boards all have the same footprint and most of the things that companies would say reinventing the wheel or being redundant in their development or in their IP you could say you don't really focus on those things the general IO, the USB ports HDMI you can create your niche on this footprint without having to worry about too much IP right so here for instance if you say the standard consumer edition board you can see this kind of replicated right here on the bottom half of this one this is an extended version of the consumer edition board and so this is one attempt right here reaching out to AI and so we're going to focus on the most but 96 boards is trying to tackle other verticals for other verticals and this is just one of them right so with that I think that's the last of my slides so we can talk more about 96 boards after if you have more questions yeah so in Fedora we obviously working closely with the 96 board guys there's also other so some of the NVIDIA Jetson ones we're looking at or sort of GPGPU as well you know the actual AI machine learning hardware there's four main categories of hardware there's the FPGA stuff, the Xilinx a number of other boards so Intel's Altyra the latter size 40 is fairly popular obviously GPGPU led primarily by NVIDIA with the CUDA framework but obviously there's other ones that support OpenCL standard a new category which we're still sort of which the Rockchips and the High Silicon Ultra 970 support have on board neural processing units and then Qualcomm has sort of a DSP which also provides a neural processing engine and so there's a number of different sort of hardware categories in there that you know we're working to support well in Fedora some of them are currently supported in the kernel and tool chains and what have you a little bit better than others so AI and machine learning has like a number of different stacks there's obviously TensorFlow that came out of Google originally that is getting a lot of traction but there's Cafe, T-Engine, Torch and numerous other sort of high level stacks to do there's a bunch of low level tool chains these are currently widely variable there's like the Ice 40 the latest Ice 40 that I mentioned earlier has quite a good open source tool chain the Xiling stuff is a bit variable at the moment although the open source side of that is evolving very very quickly some of the neural processing engines I'm still sort of investigating because a bunch of them are quite new on the market but as a result of that this is changing quite quickly and the hardware software like TensorFlow for example is very CUDA oriented currently although there's a number of organizations that are trying to sort of make that less platform specific so the hardware software interface is evolving pretty quickly if I could add on that of course you can so with the FPGA space right now Xilinx in particular they're working on an SDK called SDSOC and with that they're trying to bring the learning curve down for those of you who are interested in developing on FPGAs so in particular talking about the Ultra96 board that Peter has right here you'll have opportunities to check out this SDK and in fact if you buy the Ultra96 you get a year worth of this license it is closed license but it is still a very interesting tool and basically what it does is it allows you to code in a language that you're more comfortable with and then it compiles it for the FPGA so you can actually work on the hardware without being a full blown FPGA developer if you're not familiar with that so it's pretty interesting and then how does this very wide and varied ecosystem sort of come together with 96 boards in Fedora so we're working together along with some others to get a unified experience so at the moment some of the devices come with Android on them some come with very custom distros of Linux that don't support standardized things that a lot of people are becoming to expect like containers and various other bits and pieces wildly varying versions of the kernel that are generally full of CVEs and especially with some of the recent spectra and meltdown stuff are very not up to date and so we're working to provide a unified experience so that doesn't matter the board you have doesn't matter hardware machine learning options that are available you'll be able to get Fedora on there you'll be able to run containers exactly as you would on x86 and then access whether it be the FPGA or the GPU or whatever so essentially be able to eventually install TensorFlow and have it work on the very underlying units without a difference in experience across there and so a wide variety of AI machine learning hardware but one OS and like everything else in the Fedora ecosystem you'll be able to run Docker and containers or whatever else on top of that in exactly the same way as you would on other architectures with all the expected security level things like SE Linux and Seccomp and basically a unified experience and you know it's going to take us a while to get there in Fedora 29 we've got initial support at the kernel level for things like the FPGA manager and various other mechanisms and it's going to be a bit of a long road with lots of work to do but you know it's starting with Fedora 29 and as it evolves over time there's the ICE stack I think it's called which will be usable with the ICE 40 FPGAs and as the open source tools evolve it should in the next year or so become a sort of relatively nice straight forward experience no matter what the underlying hardware is so does anyone have any questions? Steven Can you give any examples of projects or initiatives that are working with this in Fedora today or soon then specific goals that they're trying to achieve with this? So yeah, so there's a number of different sort of IOT I sort of started to get involved in some of this because I have IOT people that are interested in this working for it in different industries for different one of the and I also had a chat at Flock last week with someone from Amazon about their Greengrass project which is a project that runs locally using Amazon technologies for local AI and machine learning so that's an IOT gateway use case that is happening there's a number of industries within the IOT space that are very interested in FPGA for IOT local AI so that's primarily my focus the Ultra96 was demoed at the last connect running a real-time number plate recognition or number plate signs and it was processing like thousands and thousands of signs a second and I was hoping to be able to demo it but we didn't really have enough time to do a talk and a demo all in sort of 25-30 odd minutes and so but I've been planning on getting that up and running so people can deploy that with Ansible as a sort of standard on the Xilinx FPGA so there's a lot of interest in a lot of different sort of areas for that sort of stuff and I'll add that also there's a building an application layer on top of this right there we've been talking 96 boards we've been talking with Mozilla and their IOT.mozilla initiative and so we're hopefully going to be working with that to try enabling the Mozilla IOT gateway across all of this hardware using Fedora so that's kind of like another example of the application layer taking advantage of this unification across the OS and across the hardware that doesn't necessarily use it at the moment FPGA stuff but there are other projects that are like the next generation of the mycroft hardware is going to use the Xilinx FPGA so we'll be able to my intention is that we can run that on Fedora and that will be used for like voice recognition Hi, you guys mentioned about writing in a programming language you are comfortable and then you guys would go the software would compile it for FPGAs I just wanted to know how comparable is the execution time of this program compared to FPGA specific programming language like if I had written the same program and something which was FPGA So I think the question is how comparable is the speed compared to programming it through their SDK versus doing it like with Verlog or you know at the low level I don't know actually the answer to that Nor do I that is a Xilinx tool and I'm not sure the exact details about their tool chain From my understanding though is that for most applications from when I hear them talking is that it's not noticeable unless you go really deep so you wouldn't notice it that much yeah I'm still trying to understand why would I want to do ML on these little boards is it because you're envisioning a future where maybe they have sensors and such they want to do like local compute train models locally is that the idea so I mean I don't know the specs of this FPGA but it's a fairly high end FPGA with basically some arm compute attached to it so you can run like a Linux distribution to do generic stuff hand over the specifics to the FPGA it's also got two Cortex-R cores on it so you can run like an RTOS at full real time on it and the idea is that you know you could embed this in a light pole with a camera attached to do real-time machine learning on the edge without having to push back into the cloud so you may have a low level like Laura or like 64K style length that you don't have the ability to push or tens of megabits a second data up into the cloud to process it so you would do it locally you know if in a lot of cases you may not have the connectivity like out in the field ship in the middle of the ocean and so it's fairly low power in terms of actual power draw but it's a high end processing that you can offload to do extensive stuff locally it's also really important to note that as developers maybe you're not so much interested in the path to product but there is a strong path to product with these boards so I mean you can work hand in hand with the links in Avnan the folks that made this board to you know do a chip down design get your own product out there so once you get up to a certain amount of units it's no longer worth it to put this board in your end product you're going to want to get rid of some USB ports and save a penny here and a penny there and then next thing you know this is your tool for your path to product so it's a development device but yes and also it's relatively standard it's useful for developers to put one on their desk so that they can play around with proof of concepts at little cost using standard tools standard distro to get things up and running because you know basically you can DNF install just as you would on an x86 server or a VM running in the cloud Any more questions? No? Okay Thank you very much Thank you Yeah Alright, oh man No, I'll be fine, thank you though Also, the one thing I do want is to different It's a race Oh, it's two I might have to play Can I do that? Can I do that? Will it go? Uh oh That's good, right? So, okay If you're using the company lab then we just have to do Yeah, I was just gonna try to get it so I have to because my notes are all in the Start recording Oh, okay Sorry, yeah Alright, sorry, you got it all set Sorry, I was trying to switch it over so I could have my speaker notes instead of having those onscreen for other folks No, we can do that I'll make it up, it's fine Well, unless you're doing it offline Well, no, it's mirroring and uh Yeah, you can actually extend the screen No, if you do it, it doesn't record It's having a higher resolution than it actually does so you don't want to do it on the screen So we are recording the desktop also So if you do extended screen, it doesn't record Oh, you can tell it to the screen to record Uh, no, it's really Well, you have to go with memory Okay, alright I'll go by memory, it's fine Let's see what happens Well Yeah, but then if I do that then that's onscreen Let's see, is that gonna show that one or this one Yeah, sure, we'll do that It'll be fun Oh, I don't have a Wi-Fi on my laptop yet I haven't set that up and I don't know what password it is Uh, did any of you have a packet? Um Ah, boy You know what, I'm just gonna wing it, it's fine I remember script, you only hear it here to hear me Ladies and gentlemen Will Woods Hey, hi Absolutely not Adam, please leave Hi, I'm Will Woods I'm by senior software engineer Hey, I should update that, I'm senior now At Red Hat and I work on things involving installations and packaging and saying a lot of costs about RPM and half people in this room have heard me rant about these things before and you're probably tired of it and too bad I'm doing it again So this is this is RPM script because I think we need to Is it Is it Do I just need to actually talk at it? Hello Oh, fine Oh, no, too much Too much, Will Is this an appropriate amount of Will? Am I audible? Wait, you're talking to your caller, it works Hello, okay Normally Oh, boy Your leather looks so Okay, all right, cool Yes, good? Okay, technical difficulties are now solved So I'm talking about RPM scriptlets because they're an enormous period in the ass and they make everything terrible and I want them all to die Well Here, here If we're being kind, they do stuff and we all if you have worked with RPM ecosystem in all of a sense that yes scriptlets do important things and without them, you know, I don't know exactly what they do but they are really important and if we didn't have them things wouldn't work Which is kind of true in that they do stuff and if you don't do them things break but like we have some problems with that They do all sorts of things that aren't great Everything runs its route That's not my favorite thing about them Every package has its own unique scriptlets and every package its own little fiefdom and so like the scriptlets for any given package are written by somebody who might not use shell script much, might not even be an Linux programmer at all might not have ever even used Linux, there's at least one package that I can think of where the scriptlets are all written in Lua because the author doesn't know any shell but could figure out Lua a lot easier so just wrote them in Lua and that's fine under, you know, the guidelines and everything like get it done I guess but this makes, doing all of that makes installs and upgrades really really slow My background is mostly on the installer team and doing system upgrades and system upgrades are real slow but slower than you think when it's really just like you're making a bunch of small changes to every package on your system part of the reason for that is every time because we don't know what scriptlets do it's a black box, we run some shell script here, who knows what it does, magic there's no magic it's, or there is, but it's dark magic but because we don't know what's going to happen in there and we don't know whether or not the next package is going to need something from it or whatever so every time a scriptlet runs and there are what, 11 different ways that they run, you can run like pre-transaction pre-install, post-install post-install, post-install post-rock, gypsy-rock, gypsy-funk those are I don't know, there's a lot, I forget all I used to know them off the top of my head, there's like 14 different places that scriptlets run and so they get run before and after pretty much every package and we have to run like, we have to run fsync and wait for everything to get written to the disk and then and then we install the files and then and then we do an fsync again and then we run the next scriptlet and we do run another fsync and then we start the next package you know, it involves forking and exacting and all that, it makes everything painfully slow compared to how much it could be under design is another thing about scriptlets, like one of the things that happens a lot during system upgrades and I always feel bad about this because I've written like three upgrade tools at this point is it, at some point somebody will be watching like very intently, very excitedly watching their upgrade happen and the whole progress meter is going doodly doodly do and then it stops and they're like, oh no, why did it stop? and they wait like 30 seconds and it's still sitting there and nothing is happening and they're like, oh no, the upgrade is broken or something that is happening and they pull the plug usually what's happening is like it's relabeling the SE Linux policy on the system or something like that but because there's no progress supporting from scriptlets because they're like, can't be because we never really figured out how to do that there's no way of knowing that's what's happening so your system just kind of sits there this is why DNF now says like running scriptlets, that's something I've made them add, like we argued for years yeah, I'm really happy that it's there but um, yeah yeah, it was it's been in, I'm going to crash when it tries to print that mess we're so bad at this oh man but yeah, so there's no progress supporting there's no way of doing, adding it really to the spec I mean we could but we haven't figured out how to do it and it would break all backwards compatibility so like we don't know what's happening in there and you think it's stuck and you pulled the plug and you know you've completely destroyed your system sorry, you should have gone and like I don't know, I'm gone for a walk so we don't the bottom one is the thing that bothers me at an existential level, we do not know what's happening in there and so for example what's that do? that turns out to be a way to just create a file it's a clever way of creating a file at the same time and also make sure you set it to the right mode rather than doing that in two operations, so if you want to atomically create a file with the correct mode somebody, I don't remember what file or what package this is even from you can do that, I had no idea that's what that was doing but yeah, you can install DevNull who knew, great oh, here's another fun thing Steven's going to have to remind me what this fixed because it's something absurd if you installed do you remember? so, what was happening here was this was actually cleaning up to make sure that we only had one edition Fedora installed, so your server edition your workstation edition or your non edition Fedora and it was designed to make sure that we only had the correct presets for that version and we had reserved the 80-dash range for that on the file system and this was designed that if we were changing the edition or assigning the edition, we would remove any others however, in Lua that change is adding a percent sign that percent sign means that previously it just treated as effectively anything that starts with 8 and the range of 0 to dot which was all alpha numerics and then ended in preset completing everything that started with an 8 which included all of the reserved presets whereas only 80 was reserved for the editions and what did this cause? this one character change catastrophe it caused a problem people were basically we had upgrades where none of these services that should be running on the system were running on the system right this one line in something made it so that people would install other systems and then suddenly nothing worked at all because of that tracing it back to that was a really shitty week for somebody condolences friend oh here it is, here's the big list yeah, so these are all the places scriptlets run and remember if you're doing a system upgrade there's 1500 packages so you got to do all of these steps 1500 times well no the the pre-trans and post-trans those only happen once but the middle parts those happen 1500 times it's slow and bad and the question is do we actually need to do all of that and so well okay by way of a demo gosh is this actually going to work we sort of put together a little thing where yes sure I'll let it run so we put together a thing and that is not a spectacular demo but you see some text happening and now it's something is booting that six seconds right there was us constructing an entire file system image from like a kickstart and then booting it in KVM and it works fine in six seconds usually it takes like I forgot six to ten minutes was what we were looking at before and the way that we did that was just essentially not do scriptlets that's about it it was I mean there's some other stuff we skip the decompression of all the package headers and stuff like that but we just take all the package contents don't put them into the file system and do a little bit of tweaking at the end to make sure it actually works I wonder if that's going to keep running it doesn't smart alright so here's the thing the way that we did this was combining these two so the way that that magic works is that I went through and read every single scriptlet in every single package in rail seven oh wait thank you and what I found out was that they're all kind of weird and clumsy and strange but they only do these six things this is everything that happens in every scriptlet in all of rail it's just that which means we really don't need to be run allowing the system to run arbitrary code 1500, 1600 times during an upgrade and all of that we just need some stuff that does these things which is I know big but we already have a solution for a lot of these things and we could have basically equivalently powerful stuff that does what you need to do and isn't is introspectable we can look at your package and say oh this one's going to create a user and you know if two packages want to create the same user we can skip one hey we can do it once we could wait till the end because we know that nobody's going to actually need that user if we're not building in a live system let's say if we're building some sort of container we don't need to create the user in the middle if we're not running scriptlets that assume that the user will be there which they don't need to because why would they but this is basically everything that happens in the scriptlets so we have stuff for a lot of this for users and groups we have sysusers there's a thing that comes with systemd that just like makes users for you you can just drop a file into place you don't need to run random code just drop a file into place your user will get created when it is appropriate to do so now there's also system control people like to turn services on and off in their packages don't do that man what if we're trying to build like I don't know a container image or a virtual machine image for somebody else don't turn my services don't go flipping my light switches this is yeah this is actually forbidden for door now but this happens a lot in rel luckily that one's going away so these are all things that are mercifully going away but I'm going to need help and sort of you know discussion on how to get rid of some of them there's some things that are like creating empty or default file moving default configs into place things like that again systemd handles this for us use tempfiles.d it handles everything except I think one case and I can't remember what it was oh that's right there was one part of sysusers.d where you couldn't previously you couldn't give a user a different shell and in response to yeah a previous version of the talk they've actually fixed that so hooray people actually want to fix these things join us won't you tempfiles.d snippets for every case I've ever seen if you're trying if you're installing a package and then you're like oh I need to create a file I need to set up some default stuff every time everything I've ever seen can be handled with a tempfiles snippet and if you have a use case that wouldn't work please come talk to me and we'll try and figure out a better way system specific data and this is something that Stephen did if you're generating keys or certificates or machine ID something specific to the hardware that you are running on or the system that you're running on if it's not bare hardware there's a specification in the fedora project for how you do that how you handle initial service setup so like don't go handling that in package scriptlets it's not necessary system configuration you just don't mess with this stuff like there's no we have presets for some of these things for I mean that's more for services on and off but like you shouldn't be twiddling the firewall you shouldn't be inserting kernel modules when your package gets installed I know you have a cool kernel module and you really think it should be installed if applicable but like you just don't there's other ways of handling that I'm sure but generally that's not applicable when we're building images one of the big things about the product I'm working on which is well there's composer and then there's welder welder is the upstream part and that was what was doing the six second image build is that we're building images from the outside in normally when you're building a system we install a bunch of packages into it and I don't know if you were here for my like we basically open up your hard drive and it's not like we're laying a bunch of bricks down to make a wall it's like we've got all these little robots that have like chainsaws and arms and whatnot attached them we throw them all into this like arena and they battle and they clamp on to each other and eventually they construct like Voltron and that's super cool that it works but it's really insanely complicated for what when you really just want to lay down all the files that are in these packages and then do the tweaks at the end that are necessary the whole point here is to figure out the tweaks that are necessary and do them when necessary this stuff is not necessary especially if you're building your package from the out or building a system from the outside because you're going to be doing it on the outside system and like I don't need your super cool kernel module installed on my laptop then there's things like caches and catalogs this is the this is like running LD config after every library gets installed we're running you know the update desktop or update icon caches and all those update whatever things that people need to run those should be done handled with file triggers we've been transitioning stuff apparently glibc has finally like it took a while for us yeah that's that's happening so like you no longer have gld config and we're getting rid of that stuff on a case-by-case basis when we can figure out when the people who own this stuff are willing to do it because this is all you know fedora it's volunteer work but it's a good idea if anything I hope that I can convince you all that like I'm not just up here like destroy all software I'm like well let's try and do stuff like smarter and one of the things we could be doing is putting control of tricky things like when do you actually have to run this update script we'll let the guy who had let the other guy handle that the guy who runs the main package that handles the tool that does the update you drop down your files and you just walk away that's how it should be if you maintain any sort of tools or any packages that contain that sort of tooling please come talk to me and we'll figure out some sort of way of making sure that your stuff gets handled automatically that's really sort of my pitch for why we should get rid of scriptlets and that's about it do we have any questions and do we have somebody with a microphone it's okay it's a nice one simple question have we solved the problem yet where you want a file in a package to be owned by a user that is created by that package because I recall that was a problem before oh yeah if memory serves the rpm guys are talking about finally adding I have been talking about adding user and group support because like packages don't really work unless packages have to to install correctly rpm's do they have user names they don't have user names they expect you know I'm going to create this user or I'm going to create all these files and I'm going to make them owned by this user and so if that user doesn't exist it breaks so obviously if the rpm depends on that being there rpm should be handling that but like 15 years ago it was like so we're trying to get them that on at present rpm itself does not handle this however as one of the bits that we got from the system d folks talk to them about sysusersd was that they have now provided upstream an rpm macro that can be used for percent free essentially to do this for us oh yeah that's right so we have what is effectively a workaround but it's a nice handy macro that once we actually figure out how to get rid of the scriptlets proper the macro can just be replaced or what backs the macro can just be replaced with that okay so there is a we can use this in a fedora spec and it will be guideline compliant and could I use this workaround mechanism in a fedora spec and that would be sort of guideline compliant as of now no and that is specifically my fault I've been promising to write that spec for fpc to approve for four months and I have been dragging my heels on it I owe you that spec okay so that's not fixed but we'll be soon I have a question about the same topic so at my medium size linux environment which has some legacy baggage many groups like postfix exist in the directory service with the same gid and the same namespelling and when a scriptlet tries to run that says add a group postfix it fails mm-hmm would this handle the case of name service switching service already providing the same group with the same gid it's supposed to I don't remember what their solution was going to be my solution is the sort of scorched earth like nobody gets assigned uid or gid that's silly they all get assigned dynamically at start up and there's support for that in systemd now where if you need a user it can just be created at the time that your service starts or whatever but that's new school my understanding is that I don't know what the current macros look like but there are definitely the currently approved snippet that you're supposed to use is always like if this user doesn't exist then create it with this uid and gid otherwise leave it alone does that we're not sure if that check works for non-local groups but thank you I can't answer that because I wrote the fix for that specific problem so I know that that is actually fixed in fedora it used to be that it would just check etsy password the proper approach now is that we'll call getpwuid to see or getpwnam whichever is appropriate and figure out if it's already if it's available to anything within a switch that's awesome, thank you very much there's a whoa oh golly alright a question on an item that you mentioned towards the end you said instead of updating caches and catalogs using file triggers instead would that be something like a systemd path unit or something like that there's capability in rpm I think it's just called like percent file trigger where the upstream package that owns the thing that actually maintains the cache or catalog watches a path so it works conceptually the same way I think that the systemd units are actually a little more flexible I think the rpm file triggers can only work for like a directory and a path or a glob but it should be roughly equivalent where it's basically when rpm sees that a file has appeared here it'll run a script so you as the package or something that needs to be injected into that cache or catalog shouldn't have to worry about it and if you are currently worrying about it find the person who owns the tool that you're using and tell them to fix their stuff or come talk to me and I'll tell them to fix their stuff it shouldn't have sounded like a threat so one of the packages I maintain you know for my work as a third party package in various ecosystems is a tool that needs to have network communication open so it's a listening daemon so it needs a firewall port opened and you said earlier in the talk to just not do it but in the case where it's running how would I do it and in the case where it's being preloaded like I don't really know how to deal with that particular case all that well I think Stephen might have something to say about that yeah I can take some of that in the case of firewall fedora's general answer to firewall which as of two years ago now I think can handle it has the ability to just drop a file and tell it that this should be enabled so it won't take effect immediately but it'll be on next boot so that is the preferred approach for a service that we should do that that being said this is also forbidden in packages that are approved for fedora because the working groups in the Enfesco make the decisions on what is allowed to be opened by default but for a third party package there is a firewall the upstream feature that allows you to set that I just want to do nice things even if I'm not in a fedora package and thank you for that does that answer your question yeah thank you yeah I think the general sort of lesson there is that activating changes to the system immediately is a policy decision that isn't necessarily up to your package but we have pushed it onto packages historically and we should stop doing that one would hope that all you would do as a package is drop files into place and then things would happen and so every time that you have to as a package or like go crank some knob or flip some switch to make the right thing happen it was probably a good idea if you figure out who owns that knob or switch and talk to them about why it is you had to turn it manually because like that's crappy experience for you and a waste of your time and you have to maintain that script forever and that should just be part of the system so that's a general sort of hey if this is a thing that you encounter frequently talk to your upstreams or talk to me and I'll talk to your upstreams or something like that but we can fix all of this it's all open source we have the power any other questions one more oh yeah that was you you you can just holler and I'll repeat the question this is sort of a theoretical question but as you're rethinking packaging and all the installation stuff what are your thoughts around things like pip and gems and all the other sort of things out there well that's the thing right it is the year of our lure 2018 and every programming language comes with some packaging system so like why do we insist that everybody use our packaging system which is like way harder and gnarlier to use this is a much bigger problem than being solved in rpm especially the way we use rpm today we end up having to wrap every other packaging system which is why we have so many gnarly gnarly things in our dependency database now the most common or one of the most common words if you split apart every package name one of the most common words that comes up is github because like every go module is hosted on github and so the word github shows up like 15 million times it's ridiculous that's a weird abuse of a system that was never designed for that sort of thing and we need to work on designing a system that like handles existing packaging tools without there being that sort of impedance mismatch so I have general big hand wavy ideas about how we should handle that but I don't have anything to tell you about right now what we should do with it today my big hand wavy talk was earlier in the day and I'm happy to buy me a beer and I'll tell you all about it but yes I probably will anyway yeah if you find me yelling about scriptlets at a tree later in the evening then I'll remind you where my hotel is and you can send me home yeah no I think that in the longer term we as a community will need to accept that people do their own packaging and we need a system that works with that instead of against it and what that looks like is a really interesting conversation but it's not one for this moment not because I don't want to have it but because scriptlets time so yeah anything else this is part of my whole like practically how do we get from the world we have right now to the glorious future that I someday envision I'd like us all to share and dance in the fields of wonder but right now we got to dig out from a huge pile of crap that is scriptlets so anyway yeah so during the center west I gave an unplanned lightning talk about how I sped up the prisoning of scientific workstations with a on rail seven with about 4000 binary rpms from six hours to one hour the other change switching from Antwerp calls dnf is outside the scope of this conversation it's a limitation of the integration between the two as opposed to but the other change was at provision time like before pblog just don't use the machine or running applications using no sync no sync is a small like 100 slot library that surprises fsync and similar function calls because the idea is at the end of a provision you reboot anyway which forces no sync that's the Debian package version whereas no sync is package for the redhead family that is a clever unsurprising and completely filthy yeah but I mean I'm glad to hear that you're doing these post transaction scriptlets like to eliminate the amount of syncs but has anybody some people are trying to integrate no sync with anaconda because we boot at the end and does sync at the end anyway is anybody like ever considered just like oh I want to do a quick provision only run fsync at the end of the of the entire transaction I don't if you yeah I think if you tried to get them to add a flag for that there'd be anger obviously disabled by default but you know yeah I mean it's a tough question because it like it doesn't really those syncs don't help anything or the problem is we can't prove they aren't necessary and that's the whole problem with scriptlets is they're a black box and there might be a case where it is necessary and we've always committed to it working that way and so we can't safely turn it off we could let you safely turn it off and there might be a switch to do that somewhere but we can never be like you should use that it'll make your stuff faster because as soon as somebody's house catches fire then we end up paying the bill or oh yeah I'm sure that people would use adopted existing systems but for provisioning time it's useful yeah because if you fail in the middle prison you just reposition but my my goals for provisioning are are to have it be that you're just laying down bits and not like dealing with rpm in that process at all you just have the payloads that you want to lay down ready to go and you can do that with the installer as well but that's more long-term goal yeah no that's that's a clever and awful hack and I commend you it's worked over 200 times reliably though oh yeah for sure it's controlled conditions all right so anything else no all right Ben thank you for your time and attention applause go ahead I tripped my work whoop just need to paint it red Steven that hand is off okay I will take care of for those of you who are in the back frankly I suggest you move slightly forward there will be a number of graphical elements on screen so if you want to see them you'll probably be better close up phones no jumpscares though right no nothing scary you are good to start okay so I'm Christophe Ducan welcome to this talk which is about slicing with GPUs virtually so how to use GPUs in virtual machines so as you probably know GPUs are used to accelerate graphics rendering for instance to play video games for design and modeling software to play video games that's important too for data visualization what you see there is an example where we show in real time about one billion data points using this visualization software playing video games in case you forgot they are also quite good for any kind of pilotizable computation for instance machine learning and artificial intelligence you have here training sets for the NVIDIA uses to train the AI for driving human genome analysis high quality 3D rendering so when you don't do it in real time now for those of you who are interested in GPUs at SIGGRAPH 2018 about three days ago NVIDIA announced a new generation called the Turing and for the first time they claimed real time retracing so why does it matter because it's actually not that new our friend Ulrich Repper at DEF CONF was talking about GPUs being something like 120,000 threads that you could use at the same time so how can you waste that much CPU power well thanks to a guy in your wheelies his fellow here and his like that so here is a very simple example so this is real time retracing so you can see that the bottom there you can see all the pictures if you want to check that this is real time this is actual time the clock it's a clock you can check with your your own clock that it's the right time now this is pretty naive because a few years ago you can do much better now for instance you can do that now so this is the same idea real time clock you may recognize a daily work the molten clock but now it's showing the real time and you can do a number of different effects characters landscapes this is a realistic looking rock except you will see that so this is all computed now this one is taxing my GPU a little bit here when running full HD you can create moods so all this is done in real time this is not a movie you can create landscapes that have physically nonsensical features okay so how does it actually work inside so the first level at the top the application is going to call an API there are a number of graphics API like OpenGL, Vulkan, etc and then that's going to go through a GPU driver which sends that in some proprietary format that I call the graphics bit stream on the chip the chipset driver in Linux in the kernel and some graphic card is going to render that in a frame buffer and then the frame buffer is going to be converted to a digital signal that goes to your screen now the acceleration that I told you about for instance for artificial intelligence follows a similar path except of course there are different APIs and you don't send the outputs to a screen now why would you want to virtualize GPUs well one reason for instance is compatibility for software development testing if you want to run a GOS a game operating system like Windows for instance for flexibility this can allow you to have video streaming to think clients to this enabled cloud gaming and of course all the benefits you can get from scalability management etc you can get with graphical devices as well now in terms of large scale deployment this is a Titan supercomputer built by NVIDIA so this is mostly done the most of the compute power in this kind of things now is in GPUs and of course you can have swarms of GPU accelerated nodes ok so the problem is that when you want to actually virtualize this there are many many solutions and it's a bit complex the simplest and most naive one is to have a full device simulation of a VGA class driver so this is really how we did it in the 2000s or something like that and the good thing about it is that it's very compatible for instance with old software old hardware and it works at guest boot so you can emulate that before the firmware is even loaded so it supports practically all the virtualization features you can do migrations stuff like that you can have as many concurrent virtual machines as you want and you can have remote access to it at the end of this talk now the concept of this approach is that it's very slow because you are emulating a device that was not designed for virtualization it has tons of legacy quirks like memory it doesn't feature 3D VGA doesn't have 3D by default so if you have 3D it's all software no compositing so no matter on desktop and so that's basically the reason why you want to move to something else and the next step up is to basically use a virtualized kind of interface that you are going to expose to the guest and your graphics API now talks in sending some specific comments that are going to be converted by the host driver so now you have the guest channel in there and you have a virtual driver that talks to the actual driver it's slightly more complicated it's a very simplified in this picture but this basically now you're sending a stream of virtual graphics comments and the rest the bottom of the stack is the same as before now the pros of that is that you can accelerate 2D you can get some 3D basically what you don't get is specific card features it's flexible the cons is that you can't expose the vendor features from the card because now what you're exposing is a virtual device that doesn't have the actual card features so that limits the virtualization features as well and the performance is at best medium you don't get direct buffer access to the card stuff like that so it's basically you can get multiple VMs running at the same time but you don't get the best performance now when you run multiple virtual machines at the same time there's another difficult problem which is that you have to schedule run ring on a device the GPU which is not necessarily designed for that context switching on a GPU is problematic so resource allocation is also an issue because now you don't have a dedicated card for one guest workload you have to share it and the GPU capabilities are hard to expose as I mentioned migration is tentative at the moment part of the problem being how you feedback the screen output if you switch from one host to the next then the graphical output is going to switch from one machine to the other so you lose some of the benefits and remote access is a problem in this case where you put the remote access control in there so the next step in order to get better performance is to do GPU device assignment so basically in that case you pass through the graphics API directly to a vendor driver that resides in the guest and now it talks directly to the hardware through vendor specific comments so now you're poking a hole through the virtualization layer now that hole can be somewhat the safety issues can be somewhat mitigated but it's really how you do it so the pros of this approach is that you get neonated performance in the best case now you're talking to the card directly you have good compatibility now with new features because now you can see the card so you can know what kind of features it exposes and therefore switch to a more modern rendering API because you use the vendor driver in effect so you get the latest APIs you get the best graphic features the cons is that for instance the boot console is problematic in this case because you're no longer connected directly to the hardware so basically you get a black screen the setup is not very flexible because it's attached to a specific VM and migration is a real problem in that scenario no sharing either I'm going to explain in a minute later why not so the next step to see the difference between these two cases I have to switch back and forth so you can see because the difference is not completely obvious but the main difference is how we talk to the hardware at the lowest level so there we are going to find a way to share the same CPU across multiple VMs so in the single GPU case you have so that's a VGPU case basically a GPU between multiple VMs and now that's a GPU that knows how to be split and knows how to support virtualization so we get some compatibility same benefits as before but you get multiple VMs and that's the big benefit here now the cons is that you still don't have much flexibility the hardware requirements are high because now you need a lot of extra GPU power because you're going to split it across and in general it's done in a static way migration and features like that are still problems now it turns out that when you do this when you do this kind of device assignment it might seem very simple it's just take a GPU assign it to a VM what could go wrong it turns out it's not that easy you need hardware support not just in a GPU itself set as well because you need some kind of isolation between the GPU and the memory it can touch and you need to be able to enable the IOMMU for instance and then there are a number of hardware quirks that you need to be able to deal with so in terms of topology you have not completely isolated things like you would when you for instance virtualize a CPU now this is in a sense relatively similar to what happens with NUMA when you go to virtual machines with CPUs so what happens now is that your devices might have some sort of arbitrary topology in the physical world and so now you can create your IOMMU groups and split them in any way so now as long as you have a single GPU it looks fine it's basically a straight line so no real issue there and if you have two GPUs that are connected to the same PCIe bridge they can basically talk to one another it's all within a given peer-to-peer domain and you can have two GPUs that enhance one another can talk to one another etc now if you have multiple scenarios like this you're still through the same bridge but if you have to go through inter-processor communications with QPI for instance then it becomes much slower and of course so that means it's not transparent the way you lay out your GPUs is not completely transparent and you have to deal with that which means that from a management point of view a device assignment is not something that is as simple as for instance allocating memory or allocating GPUs essentially at the moment it's really host spinning and this means it's not really designed not very applicable for a cloud style of solution at the moment and I'm talking about the case where we're not really showing so in order to mitigate that we want to use the mdev framework and in that case you're basically exposing through Linux Linux in that case is going to give you an interface that gives these mediated devices so mdev that lets you get the kind of acceleration you need in virtual machines so that you can still talk directly through the memory of your device but they can't overwrite to some other location and then you can still get now mdev is a really complicated topic I can't cover it today but I invite you to look up this presentation if you want to see exactly what you get out of this now the problem is that there's an interface that is not very easy to manage at the moment there are things like type description number of instances etc and all this is at the moment completely vendor specific so you sort of have to have your management software learn the details of what's inside and there are a number of things that are not exposed like device quirks driver limitations whether it can support multiple instances or not for a single VM all the stuff related to licensing is hard to manage it that way as well so I expose a number of different ways to split your GPU and if we try to compare them well we see that native GPU will give you best software availability and performance for instance and then you switch to emulated GPU that performance takes a big hit scalability takes a bit hit but again insecurity and flexibility and so if I keep going like this and try to compare all the various cases I see that there is no one big win it's not like there is one that is the best it's really a series of trade-offs so this is why it's still very much work in progress for everybody for instance in table of features this is relatively recent announcements for live migration of NVIDIA GPUs sorry the one before that this is a demo so I was telling you about the problem with migration where you see that it switches from one monitor to the next so if you have to run across a data center in order to see the output of your machine that's not very convenient so another thing that shows that this is still very much work in progress is that all the resource allocation and sharing is still very much on a permanent basis and not completely set which means rather from a management point of view there is no single way to do it so to take a Facebook analogy it's complicated now if you add remote access on top of this so SPICE is a red hat solution for remote access and it tries to address some of the issues we are talking about here notably about things like integration etc so most virtual machines are used remotely and so when you connect to them in a data center you're going to basically get a stream of video updates now this enables a number of use cases like cloud gaming you can stream high quality 3D to your phone for instance because H264 and streams like this are completely asymmetric it's much easier to decode than encode and so you can have a big heavy machine that does the encoding in the cloud and then a lightweight machine on the other side for visualization so this means you can bring your work environment to any devices it has many nice features so for KVM users the solution for to get this kind of things is SPICE and we're working on making SPICE basically streaming capable like this so to explain a little bit the evolution compared to what exists today you can have there are multiple ways to do remote access you can do it either from the guest or from the host and historical solutions would be sending 2D comments basically so they will intercept drawing comments somewhere and send them over the wire render them on the other side now you can intercept them in the guest and that would be typically for Microsoft remote desktop or these kind of solutions or you could have in the case of SPICE, the SPICE server do it for you in which case you get the remote console as well for your VMs etc for video streaming it's more or less the same idea you can do that from within the guest or you can do that from within the host I'm going to show the trade-offs in a minute so what happens when we do this is that instead of sending graphics comments in device specific format like we did before instead we are going to switch to network kind of graphics so if we do that from the guest for instance if you were using terminal Windows terminal services you would use something like that where there is in the guest some kind of software rendering stack and then it sends the data over the network and you're basically using the virtual network the normal way now when you do this this is widely available most operating systems now have this it may be as simple as X11 in the case of Linux it's transparent for most users but of course it only works after the guest has moved in so you have no console in that case and it does require that the guest network works now why is it important because in some cases you may want your guest to be something you can access remotely, graphically to do something but that the network is not going outside and you want the network to stay constrained inside the VM and so that's why we have also host-based remote access in which case in the case of Spice we have a Spice server component that will do the encoding and the way this works then is that we have basically a driver in the VM in the VM in KVM basically that is going to intercept 2D comments and behave mostly like a VGA then transmit that to the Spice server Spice server sends that over the network and you have a Spice client on the other side that renders that as that works pretty well for 2D but it doesn't work very well for 3D so if we want to do 3D we do need some kind of streaming so something like that looks more like we have an encoder that generates video on the fly now if we want to do that with vendors like NGDA they really insist on doing this in the same way as is being done for virtual GPUs and so remember that in the VGPU case what we had was that the GPU driver resides in the guest so that's why you need to do the encoding as well so that means basically that you are encoding H264 from your framework within the guest which means that you need in the guest some kind of agent that is going to send the data over and you send it to the other side over the network so this is a little bit complicated to setup and it has the same kind of drawbacks as we had before because now you have to wait until it's booted and you have to have an alternate solution pre-boot that shows you the console before you boot and so finally something that is not done yet but I hope that it will happen within maybe the coming year or something like that is to do host side streaming still using hardware acceleration and so in that case what we do is we send the graphics down there and we still have the splitting abilities that we talked about before with VGPUs now we are using the vendor GPU driver here and we do the encoding host side meaning that now the whole network stack happens here that way then the nice problem that you have in this case is that you need to find a way to share the features share the encoder share the frame buffers etc now why is this a problem because now your graphics API is doing the rendering in this side so it's doing basically all the rendering is done in the guest and you need to find a way to pass the guest data preferably without copying it in a way that the host driver can see it so there are some developments that are going on in the Linux kernel at the moment to try to enable that but it's a relatively complicated problem and so you have to realize that for instance the rendering may happen at 60 or 120 frames per second so you need to be able to pass the data from the guest ideally guest user space down to host kernel without copying it so that's very expensive and I've actually done and I spoke too fast so I'm out of time I removed a number of slides compared to yes thanks for the presentation I wanted to ask because in the Kubernetes community whenever you request a GPU then you get the whole GPU and sometimes you might not utilize the GPU fully so it's very interesting the work that you're doing here with splitting GPUs so this VGPU if I wanted to try this out where would I find this okay so some of the stuff that I showed is actually working today with well okay when you say you want to try it out under which kind of conditions do you want to try it out like I want to download it I want to install it on 3 node or 5 node Linux cluster this is for experimental purpose not production kind of use right yeah this is probably going to be experimental but because the part of utilizing GPUs in Kubernetes is that limitation of not being able to split how much resources if you want to request how much CPU you want you can do that but you can't apparently do it with GPUs so at the moment this is still a lot very much work in progress so if you want to have oh yeah okay so NVIDIA VGPU is supported in rel75 so if you want to play with it I think you could with CentOS but you have to buy the grid software from NVIDIA yeah but it's supported as I understand today it's supported in queue configurations today so one to one so it means it can't split I think no it supports VGPU but not with not with SPICE yet that's all okay yep so it's supported with rel75 and with rev 4.2 but I don't think you want to set up rev or overt so you just have to get the grid 6 software from NVIDIA so is there an open source alternative like OpenCL right now? no so at the moment as far as I understand none of the VGPU capabilities that NVIDIA offers are supported by Nouveau for instance if that's what you're asking OpenCL so for OpenCL okay I would need to check because my understanding and Karen's are different my understanding was that at the moment NVIDIA was still offering only one queue basically whole card configurations and that the rest was still unsupported configurations Karen says otherwise so I think I'm probably wrong on this and maybe this is only for remote viewing now for compute only in full open source I don't think there is anything that works if you want it full open source so Intel is also working on VGPU so KVM GT so that would be fully open source there's another question in the front I understand why the streaming agent approach is not ideal but are there any open source streaming agents I can use right now with over slash rev or with KVM and Libvert so the streaming agent itself is open source and there is it's basically built so that there are plugins in the which we are trying to open source because they use APIs that are normally public so there is no reason but it's still under discussion whether we can actually open source that or not now fully open source one does not use a hardware acceleration that's the problem so it basically does mjpeg encoding and stuff like that which is only software so it's good for testing purpose but it won't give you the hardware acceleration it will give you hardware accelerated rendering but not encoding I know that Spice does mjpeg is that part of the Spice like Guest Agents and all yes that's what I'm talking about it's built in the Spice Guest Agent right now and now there are other you can of course use other solutions like BNC or like TGX or whatever so if you use BNC for instance you will have very fast rendering on the counter but you will get very slow transport over the network okay and there is lots of proprietary streaming agents though I guess yes as I'm sure you have TGX as an example any other questions thank you so assuming there's three hardware vendors you know NVIDIA AMD and Intel do you have to buy professional graphics adapters for virtual GPU including Intel do you have to have like a Xeon and all okay so as I mentioned in one of the earlier slides so in order for this to work you really need good support for IOMMU from the cart side it really depends on the vendor each vendor has a different approach so for NVIDIA you typically want recent carts to get VGPU you really need carts that are dedicated for this for historical reasons AMD used to have the IOMMU support baked in earlier it doesn't necessarily mean it's better supported in Linux but at least from a hardware point of view older cart have better chances of being isolated it doesn't mean that the driver support is good and for Intel Intel is working in a very open source way so their own approach is from a software support point of view is nice to have the problem is that the performance at the moment is not the best among the vendors yes okay but are those harder features only in like the quadros rather than G4s isn't only in the radian pros rather than radians as Karen said it's worse than that in the case of NVIDIA it's a specific software license so the grid software and basically you're in order to activate that you have to talk to a grid server you have your license for this or that configuration so it's not just the hardware it's also a software license and there is also a specific grid SDK that lets you take advantage of some of the features I was mentioning that we're using public APIs for the streaming agent that's part of this grid SDK for instance so APIs like how to capture the frame buffer, how to stream it and code it stuff like that looks like we are out of questions thanks a lot for your attention and sorry for talking so fast