 okay so let's start hello welcome great to be at a free VSD con and pardon euro VSD con again it's all of ESD's definitely missed that last year yeah too bad we cannot meet in person I would really have loved that but let's start with my talk we have gathered some experience in running lots of jails on lots of machines and I'm going to talk about the networking setup that we developed and the issues we have with that and our future plans for people who don't know me yet I'm okay general first sorry I'm a bit out of practice in presentations yeah short introduction then presenter current architecture how we went IPv6 only for jails because you probably heard that there is a depletion of IPv4 address space what challenges we face specifically on layer 2 and why I think that these challenges are rooted in Ethernet being not a proper paradigm for networking in the 21st century anymore we'll see I'm working in IT since 86 started with minix you know free VSD users since 1993 and I'm managed the network and our data center operations at punk.de punk.de is a hosting company and a software development company I'm in the operations team and we have three operators and one guy who is our Python wizard coming from the development side and is getting into operating while teaching as operators how to write proper codes and use agent methods use get use ansible all of this stuff punk.de was founded in 1996 as a right member deenic member has a couple of web applications development teams and as I said one one operations team we use jails for hosting so customers that want to host with us get the illusion of a root server that is located inside a free VSD jail we maintain all the software if you're not familiar with such a setup or how free VSD jails work my presentation from your VSD con in Paris is still up on YouTube so you could look into the architecture that we designed for hosting applications from the operating system the file system layout and so on point of view how we do updates and all that stuff today is a networking day for me so I treat that as a given we're running in jails jails for a couple of years now have the V image or Vnet feature in free VSD which is a completely virtualized network stack so you can have private interfaces private firewall rules private everything inside a jail to complete the illusion of a virtual machine while the jail still run on a single kernel and are compared to our hypervisor very lightweight very fast to launch tear down provision etc the ePair interface that we use is a virtual patch cable one and inside the jail one and on the host system and the general setup is to bridge this to the wire by creating a bridge interface that contains the physical interface of the host and the jails ePair interfaces I show the architecture in a graphical way there we are so the blue box is the host the host has got the network interface that is connected to some switch and then we create a bridge interface that contains the physical interface and all the jails and all the jails have the illusion of being connected to the regular network and run their own IP stack via Vnet can do their own routing their own firewall rules as I said and so that all looks pretty great and scales quite well for certain values of quite well the view inside of such a jail is like this the interfaces are called ePair zero for the first one ePair one etc up EP and we have two ends to each ePair interface the ePair zero a would be outside of the jail on the host this is the one that we put into the bridge as a member and the ePair zero b is the one that is inside the jail and as you can see it's got iNet six link local address it's got a global unicast address it's got an ipv4 address and the ipv4 address comes with a net mask of slash 24 so it's all really the illusion of a complete real host hardware to our customers and they are quite happy with the product we have one drawback at the moment we're sometimes reaching performance limits in free vsd 12 the bridge interface that we use used to be your bottleneck being only single-threaded thanks to christoph who i saw is listening for working on improving that by a fast amount one order of magnitude i remember factor of five or even more by making the bridge interface multi-threaded so it can run on multiple cores so from that point of view the bridge is not a problem anymore but we will get to other drawbacks that we're facing later we rolled out this four or even five years ago we have in the order of couple of hundred to slightly under a thousand jails running on 80 90 physical service in total and the first major change compared to what i laid out so far is that we were going ipv6 only so because we cannot afford to provision an ipv4 address for every single jail anymore the standard product that customers get now is only ipv6 and you can get an ipv4 address for your jail for an additional fee but of course we make sure that all the applications that our customers want to run are reachable via ipv6 and ipv4 all the same so how do we do this we have the same layer two architecture with a bridge and everything only that the jails don't have an ipv4 address and we added a jail that we maintain that runs a gate six four setup which serves as a gateway for all the jails to reach ipv4 targets and which serves as a reverse gateway proxy for customers who are on ipv4 and want to reach their jails the egress part works with nut 64 there is a special address range in ipv6 networking that is reserved for this purpose we route this address range through the gate six four jail and the gate six four jail does the net with the ip firewall that is native to free bsd and for all of this to work the resolver needs to lie about quad a records i'll show that in a short demo let's see this one possibly okay there we possibly in their 40s or 50s like i am and a bit vision impaired i hope this is large enough for everyone to see yeah great so this is inside a jail the product is called the pro servers so all the deals are called virtual pro servers that's why it's v pro and then there are just numbers kettle not pets and the external interface as you can see here has got only ipv6 configured now when we want to ping a host that is only reachable via ipv4 this does not work network is unreachable because we don't have any ipv4 connectivity but when we ask for a quad a record which github.com does not have then our name server fakes this address fake because this is the special address range used for nat64 and this is just the ipv4 address that we got in the last request over here this one encoded in hex there's even a special syntax you can write nat64 addresses like this and this way all our ipv6 only pro servers can reach destinations on the internet that only have ipv4 so they ask for a quali because they were running a strictly ipv6 stack the name server lies and gives them a nat64 address and the nat64 gateway takes care of translating from ipv6 to ipv4 and vice versa so that's the egress part and then of course we need to have an ingress part for applications we run an sni proxy sni is short for server name identification or indication server name indication this is an extension to the tls ssl standard the problem with ssl as originally designed was that the htds traffic was encrypted right away so you go directly into the key exchange certificate exchange certificate check stuff between the web browser and the web server and you have you cannot indicate a host header before going into the cryptographic exchange which led to the undesirable situation that you needed to know which certificate to use on the server side up front and so you needed essentially an ip address per ssl enabled web server this was changed with sni so the browser now tells the server which host it wants to connect to then the web server can pick the proper certificate and then the key exchange and everything starts sni works not only with https although the reason for it to come into existence is https it works also with http which is important in our setup because we need http for let's encrypt the what a record of say www.customer.de points to the jl proper to the ipv6 address that the jl has got natively and the a record points to the gate64 jl and what then happens is that the browser connects to the sni proxy if it has only ipv4 available requests with sni and indicating the host name that a certain web page is to be served and then the proxy looks up the ipv6 target address via regular dns and if this is in the permitted range i.e is in our data center otherwise you could use our proxies to anonymously connect to any target on the internet but if it's on the same host or in our data center then the request is permitted and then the sni proxy connects via tcp not via http proxying but via a particular tcp proxy protocol to the web server via ipv6 and then the regular key exchange certificate negotiations that starts and you get a proper https connection with encryption and everything that you want the most common problem that the customers face with this setup is that they forget to set in quad a record because most corporate and enterprise people out there still think only in ipv4 even today but generally with our support this works really well for web requests all the applications are available via v4 and v6 we run next cloud or other complex stuff including web sockets through this setup and we don't face any problems with for other applications things don't look quite as well we have a native ipv6 of course for ssh or a jump host that customers can configure and if you want to access your mass equal database your elastic whatever you need to use an ssh tunnel if you're only ipv4 connected because we have no host name where proxy like we have with with sni proxy for all these applications and we are not sure how we are going to handle quick the new http successor protocol designed at google http2 is already enabled across all our servers and works really well but quick sorry no clue if sni proxy and tends to implement that if it's even in the protocol that you can do sni i don't know at the current point of time so that's the ipv6 address depletion topic and i'm quite happy with the solution we developed so let's switch to the next topic which is challenges we have with network infrastructure um i'll link this bug uh which was the first indication that something we do might be not quite optimal this is not to to blame anyone bugs do happen and kristoff and björn have been a really really really great help to finally track this down and fix it the fun part about this bug is it only happened in our data center in production nobody else was seeing this specifically kristoff and björn could not reproduce it on their development machines or in test setups absolutely no way so what's happening what we saw uh was that for some reason an entire pro server host physical machine stopped forwarding traffic to all of the jails that were hosted on this machine at once no traffic in no traffic out and this happened every couple of weeks every couple of days and it happened more frequently as our data center grew and we provisioned more hosts and more jails the nature of the bug is that the e-pair interface has got as a real hardware ethernet interface would have to has got a transmit and receive q where you can queue up packets and due to the bug whenever that q filled up to the brim then the interface stopped forwarding packets this is hard to reproduce if you have a single connection and a single development machine because even with a one gigabit stream tcp running at full speed you will not fill up the interface q so what's different in production but what's different in production is we have broadcasts and we have lots of them of all the packets that are host sends and receives including all the jails running on the host of course 40 percent of all the packets are broadcast or multicast packets only 60 percent is traffic directed at a unicast address which is quite a mouthful so what happened is that you have for some reason a bit more of this broadcast traffic it is received by the bridge interface on the host and then because it's broadcast it simultaneously forwarded to all of the e-pair interfaces on all of the jails filling up the queue i want the queue is full the interface stopped working so what to do about it as i said Bjorn was really helpful told me how to increase the queue length there's a ctl parameter that can do that which is not very well documented then of course you can fix the bug that makes the interface freezing i'm not quite sure if the final rework of the e-pair is already in the source tree the bug proper seems to be fixed but of course this entire situation is not really optimal we have 40 percent broadcast packets and seriously folks this this just sucks um in 2010 or so i attended the conference in cologne with a focus on open stack and one of the companies that deployed open stack which is a private cloud infrastructure as open source product told us how they first naively deployed open stack for their customers and as soon as they had a couple of hundred customers with vms running on this open stack cluster suddenly the router connected to that network crashed because the arp cache of that router was just overflowing as was the neighbor discovery cache for ipv6 so if we want to communicate on the ethernet in such a large network with lots of virtual machines and lots of jails commonly called the broadcast domain we need to broadcast packets to do address resolution and how do we get rid of them okay first step move the bridge of the wire there is actually no need to have the bridge interface that connects all the jails and the gate 64 to connect it on layer two to the host interface and the local area network again i can show you what that looks like in production we have it running this is a pro server host with that setup and if you look at the interfaces of that host then we have the physical interface we rename all of our interfaces so we have the same names for external connections and everything else across different hardware variants this is an iGB i think and we rename all the external interfaces of the hosts to inad0 the inad0 interface this host is not in our data center it's located at hetzner in nunbeck and they give us a single ipv4 address for that host and they give us a slash 64 for ipv6 and they statically root this address and this slash 64 to the MAC address that the host has got now to host virtual pro servers i.e jails we create yet another bridge of course but this bridge called jail zero is not connected to any physical interface this host is running only one jail at the moment and as you can see the member interfaces of the bridge there is only one which is our jail if you pay for them you get additional ipv4 addresses for hosting in the regular rootable sizes so we order at hetzner a slash 29 with all the pro server hosts and then we can flexibly offer customers either a dedicated ipv4 address or our gate 64 nut 64 setup as i just showed in this case we have a dual stack pro server and in the jail interface we have one address out of this slash 29 assigned to the bridge and this address serves as the default gateway for the jail so there is no need to connect the bridge to the local area network and fetch all the broadcasts that are floating around there we just have broadcasts if there are multiple jails on the bridge proper which is not as much of a problem as putting a couple of hundred jails on a single wire and of course the host itself is acting as a gateway in this case so forwarding for ipv6 and 4 is enabled and as i said hetzner is giving us a slash 64 for the entire host and as many jails as we like so as with the ipv4 that i showed for the ipv6 we have one address with the regular prefix length of 64 on this bridge interface and again all the jails use this router as their default gateway and additionally and this is first time or show a neat trick that we are planning to use with v4 and v6 all the like in a future development you can actually split a net or assign single addresses to other interfaces if you use a prefix length of slash 32 or slash 128 depending on if you're talking v4 or v6 so this is the same prefix that hetzner assigns we use the first address for the external interface of the host so we can reach the host via ipv6 and use a prefix length of 128 and due to this prefix length we can easily assign the rest of the network the slash 64 to our bridge interface and use it for days for jails we have come to randomly generate ipv6 addresses because if slash 64 is really really large for listeners who might not be familiar with ipv6 that much slash 64 means it's the entire old ipv4 internet squared so that's really a heck of a lot of potential hosts and if you use random addresses for your hosts and rely on dns that means that nobody can linearly scan your network for active hosts to attack which is of course security by obscurity but in that case i think it works if you just start at one and then go to three four five people can scan your network for hosts and we'll find active machines in that case all of the IP addresses are random which is definitely an advantage here but this setup I think shows that routing might be a good idea because if we use layer three instead of layer two we might be able to solve the jail vm mobility problem if i want to move a jail from one host to another one i can at the moment only move it to a host which has an interface in the same vlan because all the addressing is limited by connection to a certain layer two broadcast domain and i might want to move jails around have them on central storage do high availability if one host crashes just fire up the jails on another host so yeah that's not optimal and maybe getting all this layer two stuff all this ethernet stuff out of the way might be a good idea after all and all the big guys already do it so the cloud providers like amazon and the hyperscalers who run their own infrastructure like facebook they all have layer three addressing and announce routes for their containers vms whatever they are using if it's if it's docker if it's kvm they announce these routes via vgp internally so any instance jail in our case can move to an arbitrary host and the routing just takes care of the reachability the downside as some of my colleagues would feel about it you need a dynamic routing protocol demon on each host like openvgpd i being traditionally a network and and cisco and that hardware guy have absolutely no problem with this line of thinking my dear colleague wolfgang doesn't like the idea but we have to talk about it in a constructive way and i think this is a necessary way to go for us next why do i want to get rid of ethernet what's what's the problem with ethernet and why do i want to have only layer three routing ethernet is this that's what you think of when i say ethernet you think of switches you think of cat six whatever twist a pair cable but all our protocols pretend it's still this so what is this this is a yellow cable with a vampire transceiver a little bit of history lesson this is the original ethernet and the original ethernet was a coaxial cable which is more or less a medium for radio waves and all stations were connected to this coaxial cable via these vampire transceivers and any packets sent by any station on the net was simultaneously more or less we have to consider the speed of light but simultaneously by all stations that are listening so this is a broadcast medium every packet goes to all stations and that's why our protocols are designed the way they are all stations are listening and then the network interface is just safe and forward to the operating system the frames that have the proper MAC address and that's why we need hardware addresses on such a medium and yeah sorry period um but this is not the reality we have nowadays the reality when we have this is that each of these links is not a broadcast medium at all it's all point-to-point it's all full doublets it's all with flow control on the links there is that cable or that port or anything notwithstanding but on the cable there is never a single bit lost packets get dropped when forwarded by the switch and when the forwarding queue of the switch is exhausted or when the interface received queue of a station i a host and the operating system is exhausted but these links have perfect forwarding properties like any serial link that you might know and actually the internet was originally developed on serial links only so though we had those internet message processors and the wider wide area links from universities the universities and all this internet stuff came in rather lately so the switches actively work to forward all the broadcasts and in my opinion the MAC addresses and the layer 2 addresses and the necessity of MAC addresses and hence ARP and NDP are a relic of the past the switch which most switches have layer 3 capability nowadays anyway could just as well just use IP addresses and know which host is on which interface would be perfectly possible unfortunately our current stack does not work this way and if we had an interface working like this we would immediately solve a whole number of problems so my question for the development community which i would really love to discuss with christoph later can we have a vnet point-to-point interface please because on point-to-point interfaces you don't need IP addresses serial links can be run without IP addresses on the links completely you just set routes to the interface and you don't need transfer networks if you want to conserve IP addresses so we have no leaking rfc1918 addresses in trace routes and you don't have ARP or NDP cash depletion attacks like the hosting company that deployed a two large broadcast domain with their initial open stack efforts so all these problems vanish the moment we abandon ethernet which is of course not going to happen but for vnet vnet is already a completely virtualized environment so and our jails are a virtualized environment and there is really no fundamental necessity to emulate ethernet to connect the jail if we could have a point-to-point link comparable to e-pair that we can use to connect a host to a jail and vice versa and just ditch the bridge and do all on layer three and do all with routing i think that would improve matters in terms of scalability resilience and jail mobility quite a bit i'm currently actively working on a solution that we can deploy in the meantime if you use a slash 32 net mask for ipv4 and a slash 128 for ipv6 you can actually reuse IP addresses on as many interfaces as you like so my idea is to use this addressing scheme and directly address the interfaces connected to the jails all without a bridge i'll show how that's supposed to work in a couple of minutes this is the desired setup so we have slash 32 addresses no bridge interface and the host is just configured as a router i've tried to do this with current jail tools now just a sec where are we iOcage for example will not as it currently stands work this way but as you can see here i assigned an address completely out of the range of my local network we have i have a on on this freedom system with iOcage a couple of jails and they have the regular setup all connected to the bridge and all the jails vnet interfaces are a member of the bridge as you can see but for this single jail for the short demonstration i removed the jail interface from the bridge and i assigned an ip address completely outside of the range i have here on my local area network so psdcon so inside the jail we have this single ip address with a slash 32 net mask and the routing table looks like this the default gateway does not point to a particular ip address but instead it points to an interface you can do things like this already in in free psd just right now like route i'm sorry a little bit puzzled what's happening in my network currently i see you all typing away so my uplink and my network in general is working maybe i messed up something with my experiments here but nonetheless okay it's dexec really a st.con okay i already showed you that that i have an interface route for for this thing okay so and then on the host i did the following oh my dear god i still have a vpn connection active the wonders of the internet we have good sound and video and i'm rooting all the packets via usa so back again i hope let's restart the screen sharing so sorry i seem like i cannot get the screen sharing up to work again which is bad but that's oh my darn i would really i would love to show you the last couple of slides but the the live demo would really not go any farther than showing that i'm able to ping the jail this way via static routing to the internets um that was the last thing i wanted to show i already reconnected to the bbp i just did a refresh that's why the camera stopped working i reauthorized the camera and i was try to reauthorize the screen sharing but i was at the penultimate slide anyway so we can in my opinion jump right into a bit of questions and discussion all of this took me longer than i expected really uh so we don't have much room but i will of course be available in the hallway track and for the rest of the day and tomorrow and we can chat if you like so thank you very much for listening so far um deep drop argued that you should never point default routes to an ip well how do you do this in the case of ethernet if you have 50 hosts on a slash 24 they all point to the upstream gateway via ip there is no way to do this with an with an interface route that's why i'm arguing that i want to have a point-to-point interface instead of ethernet so i can use interface routes for all this that's it that's exactly the point any any other questions yeah sure of course you can hardwire the mac address do you preserve the layer three layer four addresses with the aj proxy proxy protocol yes we do um we preserve via proxy protocol the real client addresses outside so you can use blacklists and everything i don't understand the question about ipv6 enamoration you said that you generate random host IDs yeah and if your customers are expected to use something like let's encrypt and their host name will appear in the certificate transparency lock which every attacker can consume yeah but it's an additional step with ipv4 we just see swipes of the entire network 24 times 7 and to make this unfeasible for ipv6 we use sparse address space by scattering all the hosts randomly over the slash 64 i still think it's a good idea um yes uh jail addresses need to be stable that's exactly the point or the problem with the mobility requirement if we want mobility for jails from host to host we must preserve ipv4 and ipv6 addresses because we do not know which dns entries the customer has got pointing to that jail if this was all cloud infrastructure with central ingress and everything we could of course just remember the jails but the jails are rented by customers with um are rented by customers as a full stack virtual machine and many of them manage their own dns and point arbitrary names to the jails that we simply do not know now we don't assign a separate slash 64 for each customer we currently we have one vlan for b4 and for v6 and we have one slash 64 and one ipv6 address per jail how do you handle abuse inside slash 64 what do you mean by by abuse faking of addresses um the jails cannot change their address after they are booted up they're just locked in terms of addressing okay outgoing smtp scam well we we we don't we simply don't if we receive an abuse report we take appropriate measures but we have we have only business to business customers i you you we have no self provisioning portal you cannot just click jail in our infrastructure and start spamming away you need to have our sales department contact you written offer contract order stuff like that and then the operating department will the the operating team will provision your jail so we know who our customers are which uh takes care of the abuse this is not unusual broadcast traffic this is just a vlan with 20 or 30 hosts and all that and dp and arp and all the other stuff plus people scanning our network from outside of course which generates our requests for hosts that are not even online but they go out via broadcast through all the jails that are connected this is not yet a real problem it's just striking how much of broadcast traffic is present on a network vlan per customer well vlan per customer does not scale if you want to have hundreds or even thousands of them vxlan lacks in terms of security and how do we connect the vxlans to anything that is physical infrastructure at the moment in the free bsd model we have this bridge thing again that's why i'm asking if we can have a point-to-point vnet interface that i can use to root anything open i think i've tried my best to answer all of the questions i do not claim that i have answers to all the problems i was just i just wanted to show you what we are currently doing and where we think this entire setup should be going we will definitely put further development effort into this slash 32 slash 128 e-pair setup i'm not quite sure about if we should enhance ioc age to the point that it can handle things like that ioc age just makes many many assumptions that are hard coded about how your network what your network topology looks like which was great when we started that's why we picked ioc age it just worked but for example you cannot fire up a vnet jail with ioc age without ioc age creating a bridge interface if you do not already have one preconfigured so if we really want to go that route we either have to patch ioc age and of course submit much requests or switch the tooling i'm planning to look into best tea because it looks really well and with a much cleaner architecture there is much much ad hocory in in ioc age unfortunately and things cast in code that should not be in code but in configuration so that's it thank you very much i hope you enjoyed and as i said i will be available for chat later myself i intend to chat a bit with Christoph if he is present and if i can get him and everyone else who is interested to discuss networking on which corner of the hallway track there's a free bsd area i'm on to look into cbsd i'm i'm planning to look into best tea so far and as for the hallway track if i remember correctly from my quick checks yesterday there are separate areas for the various bsds so let's meet in the free bsd corner i'll just restart my network and my browser so i get a clean connection and webcam and everything again thanks