 This would be about container networks, well, presumably rootless, as you might have guessed, and passed up. So what is passed up? As you might have guessed, I'm Italian, but I'm not talking about that. So let me show you something. Here I am, I'm on a server, and I do this, and I root. Let's take this server down. Am I really root? No, I'm not root. Right, so great. Lucky. What's the trick then? Wait, did I get through? Did I just reinvent and share? And I'm just so enthusiastic to show it to you. So OK, let's say to understand where I am. And the good thing sometimes is to look at networks, or network interfaces. OK, so I have loopback, and I have another network interface, but it's down. So how am I here? I mean, this makes no sense, right? OK, let's get out of this madness and go back to my host, have a look at what I have here. OK, that makes more sense, right? I have an internet interface that is up, looks similar. So OK, the MAC address is different, though. And it's not the same interface, because one is down and the other one is up. So OK, I read the man page for you. I actually wrote it too, but I would not try this. And I will check addresses. I have into this strange route, no routing. Oh, yeah, now OK, great. So this is up, or at least administratively up. The state is unknown, because I didn't send an impact yet. I have some addresses, IPv4, bunch of IPv6, link local. Great, OK, this sounds making more sense. And let's see if I can read the internet. Sorry, we talked about pasta, so yeah, great. So IPv6 is up. I see IPv6 is up. DNS resolution was working. OK, so did it just re-implement Podman? No, because it's full of stuff here. And the container will be clean, right? I just started it. OK, so let me quickly explain the trick to you. And note that I'm really not root. I'm really, really not root. I can't delete one thing from here. Sorry, LS, I can probably do it, but let's say I want to. Right, so no, I didn't re-implement Podman. And in fact, Podman can also use this thing. And let's have a look at this. OK, pretty similar. So the interface is still the strange name. Well, that was the interface on the server. I have the same addresses. Let me check that this works. Yeah, I can install IPv3. Great, so this seems to be some user-mod networking. And before I finish revealing the trick to you, I just mentioned user-mod networking. And it might look like I'm copying packets from user space and back, and this is terribly slow, right? I mean, somebody probably played with Lyip. So OK, let me go back to my strange tool here. Actually, sorry, let me run a server first. So I demonize it so that I can just keep a big terminal here. I hope you all see here, or is it even? No, it's OK. Great, so how do I reach that? Well, it looks very local. I will try with what I lost. And yeah, I'm being quite arrogant. Come on, I know that it's fast. And I can give it 32 megabytes of TCP window and zero copy and disable Nego algorithm. Right, maybe look even. Let me do two flows. Great, 60 gates per second. So, right. OK, now I can reveal to you the trick. So I'm not rude. I created a network interface. And this thing is pretty fast. Let's look at some diagrams. What's the trick? So how does networking work when you are nobody? Or, well, in my case, you are as preview, but it doesn't make a real difference. So you don't have root. I don't have CapNet admin, right? I didn't show you, but trust me, I didn't cheat on that. So I cannot have interfaces. But Linux allows you, if you detach your user main space and the network main space at the same time, since Linux 3.8, which she is a few years ago, five, six years ago, I think, at the same time, we can actually create a network interface somewhere. So we can actually create a network interface because we are UID0. So you might call it root for convenience, but that's not root. That's just UID0 in the main space. And UID0 can do a lot of things like creating network interfaces. Oh, and there is the Tantab driver. It's a kind of fold implementation, but quite useful. This thing creates a network interface on one hand, and on the other hand, you have a file descriptor. It's not a socket. It's a file descriptor. So sockets can be represented by file descriptors, but it's not the same thing. And on these file descriptors, you get frames. And you can write frames. It's not frames. So the whole thing, you know, like it go on a cable, don't have it here. So it goes in here. Right. And then we know that regular users can open TCP and UDP sockets, for sure. I mean, when you start a browser, you don't do pseudo file folks, right? You just start it. So are you thinking what I'm thinking if we do this? So we have the network namespace. We just created this top device, which gives me the internet interface inside that. And down there, I have the internets. And then I know that as a regular user, I can do TCP and UDP sockets. I just need to fill in that something. Not saying that it's so simple, but starts looking doable. And that's something, namely, needs to take the internet frames away, take the IP. Sorry, the internet header, the IP headers away, put a payload into layer four sockets. And then I think we are done. And when we get something from the internets, we need to ask the kernel. So this something is a user space application. And we need to ask the kernel, where is this bucket coming from? And then tell our network namespace. And we have a number of ways to. So we have two addresses, essentially, right? So just a reminder for everybody. Layer one is the physical network, physical layer. Layer two is something where you can put bytes on. The data layer. And then you have layer three, which is IP, or can be other things, but in our case, it's IP. Layer four is a transport. So TCP, UDP, ICNP, DCCP, whatever you want. And then the other layers are more related to YouTube. So great. And why am I doing this? So what's the whole point? No, not because we can. Also because we can, otherwise we wouldn't be doing it. But the important thing here is that I don't have root. I don't have kept net admin. So if I'm doing a container like a Joss with the podmon and somebody hacks it because they didn't apply security patches or because I'm dumb and I just map ports without authentication or something. Well, I have no embarrassing consequences or limited embarrassing consequences. It could be much, much worse. And let's say, even if nobody does that, we have the safety that what a user can do is just open and connect and bind to TCP, UDP, and ICNP ping. So the so-called ping sockets. So that means I have the safety that nobody will spoof packets because if you can send arbitrary frames, if you can spoof ARP, you are in control of a network essentially. You are telling everybody, this is me and this is him and this is her. So that has serious consequences if you can do arbitrary craft type arbitrary frames. And it can be quite fast and also flexible now. So I would start saying that it looks like a good thing. So what would have been the alternative? The alternative would have been that I went there to my server and I created a container and then I would have done IP add something, created a bridge, did something with net filter maybe just to drop a bit of the really, really suspicious things like ARP frames that come out with totally random MAC addresses and stuff like that. But the truth I've involved that I needed root and I realized there was a talk about that actually yesterday. So I want to also say that this is a limitation for some applications. Maybe you're really running ARP proxy inside a container. If you want to do that, hey, you need root. But then there is a good reason for it. And OK, that's how we work around. I mean, I'm panning frames, panning headers, removing headers in a new space. So you're going to fix it right now. I don't think so. I don't think there is something to fix. So there are reasons why this is not allowed. And I mentioned some of these reasons. So, right, I mean, an unprivileged user shouldn't be in control of the network as well. So this is kind of, there are many ways to divide unprivileged and privileged user. That's pretty much the Linux and BSD way other operating systems to completely different reasonings. There are, I heard of operating systems running UI at ring zero, but Linux, luckily, doesn't. So, and OK, let's say, like in the Tokyo SRE, we do it for them, but are they really unprivileged then? I would start having some doubts. So I think that if a network interface, if a network name space, like we did as an interface, then we should only, we should follow that philosophy. OK, it doesn't have root for a reason, and we should stick to it. Because, yeah, these are the three advantages we really don't need to even debate whether it has privileges or not. And actually, implementing appending headers and removing headers is a way of ensuring isolation. We are in control of this, and we check that the kernel once forces us to append those headers as we send packets out from the bottom to the diagram, right? So, when we are there on the layer four sockets, the kernel doesn't allow us to say, to put an IP address and the checks on there. No, it says, I do it, I take care of it. I don't trust you. So, that's how you implement isolation. I cheated a bit in the demo earlier. So, we were actually in the left case, so I didn't do so much in that thing because, yeah, I used a local connection in the demo. It's a bit easier, and in that case, I already have layer four sockets because I am doing a connection from my perf from the container to the host. Well, host is a bit of a misnomer, right? But it's still the same host. It's a different partition of the host, so to say, and that I already have layer four sockets. I can just splice this. Splice is a system call in Linux that just allows you to splice to a pipe from a pipe to a socket, then from that socket to a pipe, pipe to socket. And you don't need to do anything special, however. Of course, we are just carrying payload there. So, that means we have no other thing and that means I can just use a loopback interface to do that. So, if we are actually to the inter, we want to go to the internet, we need to do this trick that I was mentioning earlier. So, we really need to append others, remove others. Otherwise, so if somebody's familiar with the Podman situation as of a while ago, there was a way to be really fast with this trick and with some amazing tricks. On the other hand, you would lose the IP address from outside. So, all the traffic would look like it was coming from the host and that's not so convenient if you have applications inside there that need to, for example, authenticate based on IP address or filter or route. So, okay, now we pretty much got, so this is pasta, okay? That's not the thing you eat anymore. That's something in between. The acronym, I don't even remember it. I mean, if you want, just go to the website and it's written there. So, we got what it does. I just wanted to present a few peculiarities that make, in my opinion, reasonably safe. So, we don't do dynamic memory allocation there and funny because we are dealing with packets so it would be natural to read from a socket, allocate 1,500 bytes or a bit more than three when we sent it. Now, if we are careful, the kernel has already buffered so we can use them. We just need to avoid dropping things from kernel queues before it's time to do so. So, when I get a packet from the socket, I need to remember that my container needs to read it and maybe it loses it. Maybe nobody's reading it or maybe they are too late or maybe it's out of the congestion window because I sent too fast so I need to keep it there. I have no other space. I'm not allocating memory. So, I can use message peak. So, message peak is a flag for the receive and receive-like system calls that allow you to, yeah, just look at it, don't drop it. No, let's pretend I didn't read it. And these avoid some classes of memory related potential security issues like, I don't know, a double free heap overflow and stuff like that. It's not completely safe. I still kind of stack overflows. It's a bit harder perhaps and it's a bit easier for my mind actually to keep track of the stuff I'm doing which it's probably an important factor for security. The TCP adaptation. So, there is some TCP adaptation. This thing needs to keep track of the connections and however, we have already two stacks around. It's actually, in the case of a container, the same kernel and two instances of TCP stacks. So, we don't need to do really much in terms of congestion window and keeping tracks off of metrics and expanding the window, shrinking the window. How much memory do we have? No, we can ask them, what's your congestion window? Yeah, okay, I use it. It doesn't do that if you want it. If you don't want it, that's why it was so confusing perhaps but it can also be convenient. So, we had the same addresses inside Podman and outside. Full IPv6 support, but I hope maybe we don't have to mention it in 2023. Yeah, actually, I could have just gone into this pasta config that I showed you earlier and asked for an address via DHCPv6 or DHCP. Did I say pasta? Right, yes, I did. So, let me cover a bit of this project history. It's not original at all, it's a big scam. So, Slip has been doing that for 18 years, actually. Sorry, is it 10 minutes, but? 15? Ah, to the questions, okay, not to the... Yeah, great, thanks. So, Slip has been doing that for 18 years. I'm really not presenting anything new, probably. We started it for virtual machines. Namely, CubeBear developers came to my team and us, but we don't want to use Slip because it has a bad name, but it's really convenient. Can you do something like that? We want to run our container with virtual machines that's CubeBear without root. And if possible, we would also like to avoid that. So, we started it for virtual machines and then at some point we realized that containers had except the same thing. So much that for QEMO, yeah, this Slip, Slip. So, Slip comes from the 90s, right? The story is kind of complicated and we'll not cover it for the sake of time, but it was a way when universities started offering dialect shell accounts to students or professors to have a natural internet access by tunneling everything you wanted into your dialect connection that was supposed just to connect to your university server. And then from there, you add routing. So, if you tunnel everything into that, then you could reach whatever you wanted, not just the resources of the university that nobody cared about. So, this is right. That's an old trick and somebody had a really brilliant idea in my opinion to use it for QEMO. And then for another brilliant idea for Podman. And there is something similar for Docker, actually. So, there is something already very similar. Also, this Slip for Net and S is like pasta and Slip's Slip is like pasta. Both acronyms are available on the explanation. The acronyms are available on the website. So, yeah, it had a bad name. Well, okay, nowadays you leak 50 bytes in one day and you get a CV, yeah, fair. However, also performance-wise, it wasn't really meant for the bazillion bits per second that we need to have. Nowadays it was meant to, yeah, time out, you know, a bit more than Telnet, post-BBS era, I would say, or still BBS era. So, it doesn't support TCP window scaling, which means 64K is all you can send and then, okay, 64K more, that's slow. IPv6 support wasn't really there. So, IPv6 was actually introduced a bit before, but, you know, there was no reason to use it. We still had plenty of IPv4 addresses in the world. And, right, so we realized that pasta was born and then, with a lot of help from Podman's development team, we shipped native integration I just showed you in Podman 4.4 that was in general this year. Since two days ago, this is now supported in Bilda, so, if you're familiar with it, that's a facility creating container, that can build container images in Podman. It's supported by Libbeard, Cubebeard, Stack Preview. We are now very few, but very committed developers and lots of occasional contributors and Podman users came up with everything possible. Let me cover recent developments, just in case you followed this project recently. So, somebody said that it's not fast enough. Okay, great. What can we do better? So, this only applies to VMs, so it's not really pasta. We have a unique domain socket to QEMO, that means copy to the socket, copy from the socket. QEMO needs to do that and PASP needs to do that. Okay, we can just bypass QEMO all together with the host user, and there it's actually faster than, don't quote me on this, but because we don't have very nice benchmarks yet, but it's actually faster than whatever you could do with truth and leverage. So, yeah, we now copy all the addresses and the routes, not just one, you saw that I had so many IPv4 addresses there because we had some problems with cloud environments, some users reported, and WireGuard almost works out of the box, finally, that's a bit complicated, but it looks like a popular use case for podman users. So, these new use cases. One funny thing somebody came up with is, what if I want to just throw away containers, deploy them quickly with their own address and they don't tell the host anything? Hmm, interesting. With the pv6, it's actually the case. Sometimes you have a slash 64, so yeah, it's actually possible. So, in that case, Pasta would just bind an address that the host doesn't have, nobody has ever seen, except for the prefix. So, you can just have a completely separate container with its own address and assign it with an IP responder that's built into Pasta. So, we advertise the prefix and nothing else, we assign a pseudo random MAC address, we take care of it, we keep track of it, and we could actually deploy a lot of containers without really knowing much about the network or really not thinking much about it. And some people already use MacVlan for it, but it takes a few tricks to set up and it's not rootless. Another less funny, maybe more less visionary, but this starts being important. You have IPv4 applications, maybe you don't have a source code for, but they are IPv4, and you have an IPv6 only set up, maybe somewhere in central Europe, this starts being a problem, you start paying quite a lot for an IPv4 address. So, there are RFCs for this and Pasta could be a good candidate. You can find more details in the bug reports there. Another one we really need to take care of now is how did I do part forward in there? You didn't even see it because I didn't do it explicitly, it was automatic, and you can do it with Podman configuration options and everything, but for Docker, actually they really need it via rootless kit. And also for Podman custom networks, it would be really nice if Podman could happily decide without stopping or restarting the container or Pasta itself actually to map a new port. And we are with that going into a general life flow stable. We don't want to implement OpenVSwitch, but we are dangerously close to it. Start preparing your questions now. So, if you want to try this out, please do, and please especially report bugs. We have mailing lists, we are a bit old school, but don't be afraid to mess up. We understand that many people are not used to patch-based or email-based workflows. There is an IRC channel on Libera, it's passed. There are weekly meetings that are open to everybody, if you just even want to listen to the dramas we have. It's kind of quite funny, and all this information, you'll find it at past.top. Credits, yeah, so David has been reworking a lot of stuff recently. Laurent is taking care of the bazillion bytes per second there. Paul from Podman's development team has got it to build out this week, and he always helps a lot. Lami is a package for Wite Linux, and he presented a lot of new use cases, landing the Libera integration, and Yoni's making the kernel nicer to us or trying to. And we have a lot of, really, a lot of contributors and packages, so packages are available for Arch, Debian, Ubuntu, Fedora, RL, a few more probably. Questions. So the question from Paul is whether, when I was referring to the cheating case where I just have local traffic, I'm referring to Routeless, the top. Yes, yes, okay. So does RoutelessKit allow, is RoutelessKit the part that allows us to skip the setup to just create a bridge to the host? No, actually, so RoutelessKit, what it does, okay, RoutelessKit is a pile of things. In this case, it's what RoutelessKit does, the trick of copying packets back and forth, not with supplies, they use receiveMessage, sendMessage, it's not very different, but that's the part that it does. And you can, you have it actually as default for order for Podman because it's faster, but you don't preserve the source IP address. So that's the part it does. Yeah. So first question of two is, if this can be used with Podman network commands. So as far as I know, yes, since Thursday. But you're probably not better than me. And if you don't, let's check it out. But I think Paul just did this because it was just, I just see it with Podman run as you've seen and I think now finally the code has been moved to the proper place so we can actually use it for Podman network. I think that's my understanding, yeah. So the second question is, when will Pasta become the default in Podman? You suggested for the six, right? And I hope we are on track. Yeah, might be tight, but yeah. Right, so the question is, I showed that I would connect to localhost and connect to the host and maybe I want to really connect to localhost as localhost, host, host, host, host. Like, like, yes. And yeah, actually one of the reasons why we are reworking the NAT model is to allow to, right now we just allow to disable this functionality. So to map nothing or to map everything or to map like the address of the default gateway to the host, but it is not very flexible. So we are actually adding more options. And what the default is, it has to be seen. We need to check with many people, like what do they really expect? Because if you come from VMs, you have different expectations as, you know. Right, yeah. I can check on metrics. Okay, thank you.