 So good morning everyone. I'm glad to see that you guys at least made it to the second talk of this day I mean that you're not still in bed Before I start my talk I quickly sort of want to like you guys to raise your hand. Who has ever heard of Kubernetes before? Oh Whoa, okay. That's a lot more than expected then I can just skip 10 slides No, just kidding. So who views has heard of free BC? Well, of course everybody Who views heard of cloud? Okay, well still half of the hands raised. That's pretty impressive. So, um, oh Oh, okay, I'll speak even closer Should not swallow it. So So yeah, my name is Ed Schouten. I've been a free BSD developer since 2008 Initially, I used to work on German all TTY stuff People nowadays still sent me bug reports and hope that I fix stuff there But too busy nowadays. Sorry about that guys Nowadays I'm more focusing on security cluster computing, etc. Which is why I'm giving today's talk So this like the title of the talk is also going to be sort of my outline for today I didn't have a separate outline slide. First, I'm going to talk about Kubernetes What it is give like a short introduction then I'm going to talk how Kubernetes is related to free BSD or how we could get it running on free BSD and Later on I'm going to talk about cloud ABI and how I'm throwing all of these three different components in the mix and Why we should be doing it in the first place? So let's first start off with Kubernetes So Kubernetes is a cluster management system that has originally been developed by Google and It's sort of inspired by their sort of proprietary cluster management system called Borg that I also used a lot in the past while I was working for them in Munich The main difference between Borg and Kubernetes is that Kubernetes is written in go and this was sort of Well, whereas Borg is written in C++ and this is sort of done on purpose because they There were like a lot of design flaws things that could have been improved on on Borg So what they did is they picked a different language. So they were sort of required to rewrite all of it and sort of Cherry pick all of the things that were good and leave out all of the things that were bad So what happened is that Google like wrote this initially, you know release it as open source put it on github and Then later on donated it to something called the cloud native computing foundation and the cloud native computing foundation is sort of a branch of the Linux foundation where sort of all projects end up that are related to cloud computing cluster computing so Other projects that are part of the cloud native computing foundation are Prometheus a monitoring system Which I think is pretty awesome and also use on a like day-to-day basis And another pretty popular one is gRPC. So that's Google's RPC framework that is built on top of photo buff Which is also sort of a re-implementation of a proprietary system. They had internally called stubby But stubby was too hard to open source because it also sort of tied in with a lot of other Google infrastructure too tightly So that's why they sort of made a clean slate approach and donated it to the CNCF So like almost all of I think all of the projects that are part of the CNCF It's Apache to licensed so that's pretty favorable favorable for us. Of course. I mean, it's not to be Z license But still better than the GPL in my opinion So to sort of give like an explanation of like the model behind the Kubernetes cluster, you know you You have to sort of get used to the terminology that Kubernetes uses to sort of understand it properly how it works so one of the common phrases you often hear when people talk about Kubernetes is nodes and nodes are just Linux servers just your average Linux installation. It doesn't matter whether it's a virtual server a physical server whether it's running on AWS or On Google's own cloud computing platform or on your own hardware in your your your basements It doesn't matter. It's as long as it's running a Linux kernel. It's a node so on top of those Nodes you want to be able to run containers docker containers in the case of Kubernetes And a container in my opinion can be defined as like a group of unix processes that share the same process file system namespace so Processes running in a container can all like send signals to each other They can store files somewhere in the file system that have become visible to other processes running in the same container, etc Kubernetes has sort of added like one layer on top of containers called pods and pods are groups of containers They need to be scheduled together on one node So they're like the smallest thing that can be scheduled on a cluster. So it's impossible to start Just a single container on the cluster You always have to create a pod that is running one or more containers not zero containers. That's pretty much useless So every pod has its own RFC 1918 IPv4 address So 10.0 point something you can just configure which range needs to be used and All of the processes that are running in like the containers in that pod all have to make use of that single IP address So you can't have two containers that both listen on port 80. They must be listening on a different port Then you often like when you do cluster computing you want to start a whole bunch of pods that are Well pretty much identical to each other So for example, you've got a web application that you want to spot like spin up 10 20 times Maybe a hundred times maybe 10,000 times all needs to be sort of the same template of a pod that you want to start and This is what Kubernetes calls a deployment. So a deployment is sort of a Just literally a template for what a pod should look like and then you're just saying to Kubernetes I want to start like a hundred thousand of those and it will well, if you only have a small number of servers be You know, it won't be able to schedule it. But if you pick a same number, then it will spin it up properly So all of these objects are configured through Jason or YAML files. It doesn't really matter You can write Jason. You can write Jamel Kubernetes except both Jamel is a bit more easy to read because the files tend to be rather big and then just having like Jason All smashed onto one line. That's pretty much unreadable. So Oh, wait, it's not going to that poison It's crawling instead of going to the next slide properly Yes, so here's a picture of what a simple cluster might look like So this is a cluster consisting of three different nodes and these three different nodes aren't total running four pods and These were instantiated by two different deployments One of them is called DB simple deployment that spawns like a maybe a MySQL container that just serves incoming requests You know handles SQL queries and maybe you have some kind of maybe background scrubbing FSCK like job that I don't know, you know, MySQL doesn't need this But if you would have maybe a sort of somewhat more complex database system You might have some kind of background scrubbing job that scrapes over the data set and removes prunes that data, for example And then there's a second deployment at the top that is sort of being spread out across multiple nodes in the cluster And this deployment was called WWW and contained one container engine called engine X And what happens if you spawn this up then it creates multiple pods But because all pods have to have a unique name it adds some some garbage at the end So you see that one part was called WWW Dash a 30d, etc. Some kind of random hash, you know, it's it just needs to add something to make it unique So that's where all the random garbage comes from if you take a look at a Kubernetes cluster at the pods that are running So on all of the nodes in the cluster you have a process running called the cupelet and the cupelet is basically a Tool that looks at like the stuff stored in the API server So the API server sort of like a database that keeps track of everything that needs to be running and Sort of compares against what's running on the system itself and if there's any discrepancies it just spins up more containers and Reports back status on whether that's successful or not So by default the cube API server doesn't have any event loops in it It's just a sort of static server and the only thing it can do is it can keep track of what needs to be running on the cluster So in order to spawn up jobs properly, you need to run two other jobs Somewhere that they can even be run on the cluster somehow But you can also run on a separate server doesn't really matter one of them is called the cube scheduler and it takes a look at all of the pods that Are sort of registered in the API server and looks at the ones that are not being scheduled on the node right now and then picks a node like it has some kind of Been packing algorithm or some kind of algorithm in place to determine what's like the best place to run a certain node And then there's another node called the Q controller manager and that one is sort of also responsible for like all the miscellaneous Event loops actions that need to be run on nodes in the cluster So for example, this is the job that hands out IP addresses on the cluster So this one this job basically says like, you know It gets a list from the API server for all the nodes and says this specific node in the cluster needs to make use of this IP IPv4 range So here's a simple example of how you would like spawn a simple pod on the cluster So not using a deployment, but just a single set of containers that needs to be scheduled on some node You just write it a YAML file where you say like this YAML file declares a pod. It doesn't declare a deployment but just a pod and This pod should consist of these containers over here. In this case, just a simple engine X container So in this case, you know for the image you can specify a very simple name If you just specify a plain name like this, it will go to Docker hub and just Download the image under that name You can also specify a full URL and then it points to your own on-premise container registry service So another sort of important aspect of Kubernetes is how they do networking. This is again done Using like some separate tools some separate concepts one of them is called services and a service is basically a An object you also register in Kubernetes and it's sort of a match a matching on pods It sort of says like these pods they together form some kind of like uniform service that needs to receive load balance traffic and If you create a service it adds like an additional IPv4 address to the cluster and If you communicate over that IPv4 address, you don't end up on a single pod But you end up being load balanced across all of the parts that are part of that service To make all of that work There are two separate demons that you also need to run on your system. One of them is called Q proxy Q proxy is not really a proxy But it's a tool that basically scrapes the state from the API server on which services are being registered and Generates like a whole bunch of IP tables rules to do load balancing across the nodes. So Yeah, this is just a job that runs in the background on this on your server you can You could run it through Kubernetes as well through some hacks or you could just Set it up with an init script on the node itself to run on startup Then there's another service called cube DNS. And what that thing does is it allows you to resolve services by name. So In the case of that web server, if I would create a service called www then cube DNS allows you to resolve the hostname www.namespace.cluster.local or with some kind of suffix to The rfc1918 address you have a question Yeah So the question is is there a reason why it's an rfc19 address? Yeah, so you need to use an rfc1918 address here because the entire idea behind Kubernetes is that you want the cluster to be like Like internal you shouldn't be like literally exposing a cluster with all of its internal addresses to the public internet directly Why it's using IPv4. It's mainly because the Google developers are too lazy to add IPv6 support That's that's the main reason there is a ticket open on github for adding IPv6 support and it's been open for open for a couple of years So they're still using IPv4 Yeah, it's like why why are bananas curved? That's little we can do about it, right? So Then this is only about like how sort of cluster internal network traffic works a very important thing is how do you get External traffic coming to the class to the cluster well They have some separate concept for that which I'm not going to discuss today called ingress controllers and this allows you to spawn jobs that Take traffic from public IPv4 IPv6 addresses and route it into the cluster over the internal IPv4 addresses So what are the weaknesses of Kubernetes I'm first going to start off with that So the networking IPv4 mention it before it's a shame that it's not using IPv6 Also the fact that it's allocating a single slash 24 for every node to like Schedule like to for addresses to attach to pods. You may actually run out of addresses quite quickly. I mean Say if you own a cluster with 256 nodes, you're already using up a slash 16 just for Yeah, running some pods, which is quite aggressive Another problem you have is because there's quite a sort of They make use of NAT Proxying all that kind of stuff to make load balancing work. It's actually pretty hard to trace traffic properly so some kind of front-end job web servers doing it like a sequel sending a sequel statement over to a database back end, but it's not working properly How do you know which server the traffic was sent to you only see traffic going from the pods IP address to a service address of the database system and how do you know which database back end was being used? Well, you see that a lot of people adds all sorts of hacks You know, for example, if your back end is a web service, you add some kind of HTTP response header To know which back end you're making use of which is a bit sloppy in my opinion Something we've noticed that coming out where we're making use of Kubernetes in combination with Docker on Linux is Hue proxy can actually get stale and then you sometimes see traffic being misrouted to like random Other back ends in the cluster, you know addresses are being reused quite aggressively in the cluster So this one time we actually had a problem where we were like restarting some jobs on the cluster and then we saw User traffic actually ending up on the staging setup even on the wrong job It's just because they're reusing addresses so aggressively traffic and can go to the wrong jobs Initially, Kubernetes didn't have any support for network policy. So all of the containers running on the cluster could basically chat with each other So security was was pretty bad. They later on solved this by adding network policy support But that can only limit incoming traffic So you can't specify which traffic a container can generate you can only specify which traffic a container can receive Which is somewhat of an improvement, but still not ideal So if you're using sort of Kubernetes to run a multi tenant cluster, I would strongly advise Against doing that. It's not pretty wise Also like the computing side so just ignoring the networking for a minute Containers they don't really take the complexity away. So every container is a full Unix environment having all of its 1980s Unix features in it and what I've noticed is we we sometimes get like junior systems administrators And we can like really easily explain to them how Kubernetes works, you know, you just run a single command And you can spawn a dozen jobs, but then you sort of have to help them debug Unix issues explaining them why Standard out on the container needs to or on the job running in the container needs to be made unbuffered otherwise Messages don't end up in logs all of that kind of nonsense. That sort of remains and is sort of slightly annoying Also the attack surface with the kernels quite huge, you know It's a Linux is an operating system that supports what 300 400 system calls and all of those need to be container aware You have special file systems like slash proc being mounted in containers and you know Linux proc is polluted with a lot of random files You need to sort of be certain that there's nothing in there that should remain hidden So again For running completely untrusted jobs You know for building a cloud computing service and using Kubernetes in the back end. I actually wouldn't you know Think it's secure enough to operate the single cluster where you're running jobs for for multiple customers and Also like the final thing that I think is quite a weakness of Kubernetes or containers in general said it actually Creates a sort of cargo cult programming culture where a lot of the containers need to copy paste a lot of garbage over to make it work You know Docker files that are hundred lines long shell scripts that That are only useful for starting up a binary in the end that are also a hundred lines long Combined with Docker images that are hundreds of megabytes in size containing a whole set of like G-Lipsy core utils only to run a very simple web application. I mean, that's just copy-paste programming Waste a lot of this space adds a lot of security issues, of course It's it should be a lot simpler people should just write a small web application Just write some code and then just press a play button and run it on their cluster They shouldn't be thinking about you know Just running entire Linux distos inside of containers Still there are some things that are pretty good about Kubernetes. It all works quite reliable The automatic rescheduling works pretty well We use that kumina and we we don't get paged in the middle of the night that often anymore You know whenever a system crashes, you know some some disk breaks down or something goes wrong with the like the networking interface Unless some node it just gets disconnected from the rest of the cluster and Kubernetes just spawns a job on some other node in the cluster and everything's all right Cube CTL the tool is really friendly to use You know if you fuse it for a day or so then you basically understand 90% of its functionality, which is pretty good Docker hub is also pretty good in the sense that you have packages for anything you can think of you have Docker I'm just anything you can think of and the project so the Kubernetes project has like a lot of funding and momentum behind it It isn't going to disappear anytime soon So free BSD. Oh Yeah, so the question is is there any way to properly trace what's going on to figure out what's going on when a container has failed Yes, so What's pretty cool is that? Containers aren't like the state of a container isn't being thrown away immediately You can always run Cube CTL describe to run like to describe a container that is already terminated and then you just get sort of a Couple of screens of text giving all sorts of metadata about the container when it was started when it was terminated Why it was terminated and also some log entries related to that so those are not log entries Generated by the program, but log entries generated by Kubernetes in the process of starting and tearing down that specific container In addition to that there's also a logging facility in Kubernetes So you can run Cube CTL logs and then the name of a pod and then you can actually take a look at the pod standard Outstanded error and it also has flags like dash f and dash dash timestamps to prefix timestamps to the output and also follow the output while being generated so At first you when you start using it You might have the feeling that you're sort of losing control and that things might be coming non transparent But in practice, I haven't really run into those issues a lot I mean you get used to it after some time instead of Browsing through far log you now have to use Cube CTL and that's that's all pretty well. It's perfectly manageable Did that answer your question? Okay So now the first question we should be asking ourselves could we port Kubernetes to free BSD? Well, we likely could I mean some people are already working on getting Docker working on free BSD and I heard from some people that it's a hack job, but Some other people are will probably step up to clean it up and then it might work after some time We could also make combat Linux in the kernel more complete to be more Linux like so jobs running in a container have less of an idea that they're Running on like some broken version of Linux, but actually think that they're running on a real version of Linux We could maybe even adopt some more Linux specific frameworks like C groups to make resource limiting and network policies Like make it work now. Well, we could also maybe extend Kubernetes to support PF and set of IP tables to do the networking So there's like a whole bunch of options all other things we can do to make Kubernetes work on free BSD But next slide should we port Kubernetes to free BSD? Well, consider this discussion somebody like from the Linux world says like why bother I'm going to start up Linux based containers anyway. Well, then your answer would be yes But at least our cluster is based on free BSD, which is awesome. It's VSD tech VSD license yada yada Then the other person says like what's the advantage of that? And then you say well in practice not much and then are there any disadvantages well Yeah, the jobs crash every couple of hours and you know, this doesn't work And you know this no JS container or something popular. No doesn't work No, you can't run Rust programs because we haven't implemented that yet. So We could go in that direction But I think the end result is this will only make BSD look bad and also like uncreated We're only trying to follow you catch up with Linux instead of doing something awesome. So What I think we should do instead Simply accept that people use want to use Linux servers to run Linux containers Don't even try to compete with it. So try go out of don't get into these arguments, you know that That don't make any sense, you know where it's just us trying to catch up with BSD It's a cat a mouse game and we'll always lose that Instead we should see if we could at least integrate with Kubernetes in a certain way and with that I mean trying to see if there is a place for free BSD nodes inside of a Kubernetes cluster See if we could come up with like Kubernetes nodes That actually provides some additional value over simple plain Docker containers Focus on niche markets and markets instead of focusing on like, you know, the 90% of users focus on like the 10% of companies that actually do software development in-house and start of firing up plain affection engine x containers and Actually try to appease those kinds of people so In the process we should try to tackle the weaknesses that Kubernetes on on Linux has and with that I mean try to improve security and try to make things more minimal You know development headcount on free BSD will always remain less than Linux or at least for the foreseeable future So we should try to keep things simple and you know Not add a lot of garbage that the people in the end don't use just Yes, yeah kiss. That's basically what I want so Now I'm going to talk about cloud a bi and you know later on discuss where I think the cloudy cloud a bi fits in This picture. So this is sort of like a reap recap of my cloud a bi talks that I've given in the past So cloud a bi is sort of a heavily stripped POSIX like programming environment and the entire goal behind it is to sort of make programs behave like black boxes, so Programs can't just open arbitrary paths on disk. They can't create arbitrary network connections to the outside world They're really just black boxes That need to be plugged in before they can be started that sort of the entire goal behind it So these dependencies on the outside world are expressed as file descriptors So for example if you want to make a program communicate with the network Then you must make sure that you start it up with a socket injected into it If you want this program to access parts of the file system Then you need to inject file descriptors of directories into it to make it work So all in all you could think of it as a sort of capsicum like programming environment The only difference is that capsicum is essentially always turned on Well, this model has a couple of advantages. So first of all One thing that's not on the slide But what's pretty awesome is that it makes it easier to port software over Because all of the features that are incompatible with capsicum have been stripped out You can just compile your software against cloud a bi 90% chance It won't build but then you at least have sort of an inventory of all the things that need to be fetched up To make it work properly Whereas with capsicum everything just builds out of the box But as soon as you started it doesn't work because You know the the sandboxing that you're trying to apply to it is far too strong So These programs can be like really tightly sandboxed They can also easily be tested because the nice thing is They don't try to sort of open arbitrary paths on disk You really have to sort of feed them files and directories that they can use So if you want to instantiate cloud a bi programs multiple times Well, it's pretty easy just inject Different kinds of resources and you can be quite certain that those programs won't conflict in any way And also because all of the dependencies are sort of known up front You can deploy them a lot easier because you know you sort of have an inventory of what kind of things they use So on the next slide, there's an example a simple c++ web server That is built on top of cloud a bi some of the includes have been left away and some of the details missing But it just consists of two parts. So first of all there Instead of trying to open like network sockets and directories directly It gets it out of a structure called the arc data. So the entry point no longer has string command line arguments It sort of has a YAML like tree of attributes that can be passed into the program And the neat thing is that file descriptors can be attached to this tree So it's not just like plain scalers Strings in etc. It's also file descriptors can be added as attributes pushed into the program So this program loops over all of the attributes that are received and extracts a htp socket and a root directory So once all of that stuff is finished that can go to a second loop where it starts to process incoming network connections using accept Here could be some code to parse htp get requests And in the end maybe some file needs to be served back to the user So it uses open app to open a specific file on disk So in this this slide sort of explains how those programs can be started up First of all, we need to run them through a separate cross compiler that you can stall us from freebies deports And once we have Um, it compiled we need to load a certain uh kernel module called cloud avi 64 And this adds support for running 64 bit cloud avi processes So of course, there's also kernel module called cloud avi 32 that allows you to run the 32 bit processes This configuration then explains how the program needs to be started up This is just some some yaml way of referring to a certain namespace a certain tag namespace as it's called But this is like the most interesting part what we're saying here is htp socket needs to be a socket That's bound to port 80 and this will sort of be replaced by a file descriptor And also for the root directory the root directory needs to be like replaced by a file descriptor And this is sort of all pass them to the to the program when we're running cloud avi run And passing in the yaml file So is is this clear to people you're sitting in the audience? So what are the changes in cloud avi since 2015? Um, the avi is now formally specified. We have like a cloud avi dot txc 2000 lines long describing all of the system calls and data types and this allows people to reuse cloud avi on different operating systems so We automatically generate c header files from it. We could even generate findings for different programming languages. So for example rust et cetera Support for more hardware architectures when I first announced cloud avi. It only worked on x86 64 Nowadays it runs on four architectures Freebies the 11 has been released in the meantime. So support for cloud avi sort of integrated into it Just install freebies the 11 point zero or point one and then you have some proper support for running cloud avi software Um also pretty awesome. We can also emulate it in user space nowadays So even if your operating system doesn't provide native support You can at least run like an unsecure version of cloud avi in user space at least test Test around with it And also more software has been ported in the meantime. So if you want to know more details about this freebc journal Um From may 2017 has a pretty good article on it. What has sort of changed over time So now sort of with the the introduction ahead Now we're actually going to look at the the interesting part like the huge long build up for all of this So now we're going to take kubernetes and replace linux with freebsd and docker with cloud avi How did I do it? so When we at kubernetes started using kubernetes, we were still making use of kubernetes 1.3 pretty long time ago And back then kubernetes was still pretty simple If you sort of looked at how the kubler demon was implemented What it is for every pod that needed to be created It invoked linux system calls to create c groups set up the networking etc And then when I wanted to create containers, it would call into the docker demon And the docker demon would then download the docker image you would specify in the In the pod yaml file And once downloaded it would spawn containers inside of the pods That kubernetes setup so There were actually two things sort of wrong or annoying about this model So first of all it sort of strongly tied against c groups You know this contain also some linux specific code to create c groups to set up the networking and the kubernetes people weren't happy with that Also, it's strongly dependent on the docker demon within the The kubernetes community. There was also some discussion about using different container formats. So rocket oci containers, etc So what they did is in kubernetes 1.5 they introduced a An api called the container runtime interface cri for short And what happened is that they just split it up in multiple demons. So kubelet Now doesn't have any understanding about container formats anymore. It also has no understanding about c groups anymore. It's just a random go app that Connects to the api server takes a look at what it needs to be doing and then forwards those requests to to rpcs And it now makes use of two separate demons. One of them is called the image service and with the image service it can The kubelet sends over rpcs to download docker images And maybe even remove them So it also takes a look at the free disk space on the node and if that's getting too low It might send rpcs over to the image server to throw away certain large Container images that haven't been used for a long time And there's the runtime service and the runtime service Implement sort of the old logic of creating those c groups and it sends rpcs over to Rocket oci or docker, etc to to spawn the containers inside So what I decided is to um make use of this container runtime interface feature to add support for um For cloud api. So what i've written i've written a couple of demons collectively called scuba scuba means secure kubernetes scuba and It consists of like two separate processes, namely a runtime and an image service written in As cloud api processes. So so these jobs already run as cloud api processes on a free bc box And the image service is quite simple Instead of downloading like a fully fledged docker image the goal behind the image service is to download an elf file A cloud api executable and stored in some kind of directory on disk Then there's the runtime service and what it does is the only thing it does basically is just fork and Run those elf files that are being provided um I've added an extension to kubernetes that instead of using command line arguments and environment variables that kind of stuff The arc data yaml specification can actually be placed inside of the kubernetes pod specification So normally when starting a cloud api application You would make use of tags like socket and file to access the disk and networking Well, these tags are gone when you're using scuba instead You just get some tags that are really specific to kubernetes to the environment at hand So for example, there's a tag called kubernetes slash container log And if you place this one in the yaml file You the process basically gets a pipe in which it can write and everything that's written in there ends up in the container log Uh for networking I've also added various tags like kubernetes slash server and kubernetes slash client Where you can say this is a program that needs to be started with a network service so, um The this will add sort of sock this will hand over sockets to the program on which they can do networking Then there's another tag called kubernetes slash mount and if you use that one you can refer to disks or two paths On the system. So for example kubernetes has built-in support for doing nfs mounts or attaching amazon EBS volumes All of that kind of stuff is supported and you can attach those directories those those mount points over You can hand them over to the cloud avi process using kubernetes slash mount So this is sort of like a Picture like of what my setup looks like so instead of using docker It's just a cupelet talking to scuba image service scuba runtime service So scuba image service downloads an image from the internet places it in a certain directory and disk And then the scuba runtime service gets the elf from this location and spawns the jobs over there Um, yeah So I got this working initially. Um I got some very simple jobs to run I could run a very simple http server. I could run a sleep executable that would just Stay in a loop Write some entries into the log file or nothing more um I started to realize after some time that the way that Like cloud api did its networking or well kubernetes does it does its networking for others where Is sort of fairly suboptimal And it's basically boils back to some of the slides I gave earlier during my talk that like ipv4 is easy to exhaust etc um So the problem with apis like binding connect So the traditional uh api for for binding to a certain port number or connected to some other host on the network is that They require a lot of security frameworks to be secure You need ip tables or pf to actually generate You know hundreds of lines or maybe thousands of lines on an average note to sort of come out with a secure policy another problem with these Functions is that they also require Extra kernel frameworks to do tracing and debugging You know the fact that I always had to log in on linux servers running docker containers, you know In the kubernetes cluster to run tcp dump. That's all fairly annoying because you get so much garbage in there There's no easy way where you can just say like I want to capture only this traffic for this specific parter container Just using a parter container's name There's also no support for metadata passing. So all of these apis are ipv4 address based Whenever a container receives an incoming connection from some other pod on the cluster Um, it only gets an ipv4 address and it needs to somehow do a reserve Sorry a reverse lookup to actually figure out which container tried to contact it So what I've done is I've written a a demon called flower Which I jokingly sometimes called sockets as a service sas um, sometimes also call it like a dating service for unix apps And it's nothing more than a demon where you can send rpcs to register register yourself as a service And you can send other rpcs to just connect to those services that are being uh registered So what flower does whenever there's sort of like a match, you know You've got one server running and one client trying to connect to it It creates a unique socket pair and then hands out the file descriptor to both ends. So it uses file descriptor passing And this project is sort of unrelated to kubernetes and cloud avi. I mean, it's In practice, you'd likely want to use those two in combination, but In theory, you could use it separately. So how does it work? First of all, you start a process called the switchboard and it's just like a traditional patch panel And what you can then do is you can run these other commands flower cat Which is a bit like net cat to like listen on the switchboard and Connect to to a process listening on the switchboard And you can use these labels to sort of identify multiple processes listening on the switchboard So this is all like a bit lame doesn't really look exciting. So here's a more sort of Practical example of how you could be using it in practice. So again, you start the switchboard But instead of using flower cat you could use cloud avi run to spawn a cloud avi process There's a simple demo web server nowadays. That's already prepackaged that you can use and Inside of the yaml, which I haven't put in this slide You refer to the switchboard listening on tmp flower and you say I want to run this process listening on flower Then there's another process called flower ingress accept and what that one does is it binds on a Gsp port number and it calls accept in a loop to accept incoming connections But then it pushes the file descriptors into the flower switchboard which and can then hand it over to a process on the other side So when you run all of these three commands, you can finally run curl local host and you get a simple response from this web server So far, this doesn't really look that exciting but this is really where sort of i'm going to describe the actual sort of value that's added to to this system is The matching that's being done on the labels. It's not like an exact matching So it doesn't require that both sides present the same side of labels the clients in the servers But a valid match is constructed when there are no contradicting labels And this allows both parties to actually attach more labels than necessary in the matching process So clients can provide extra metadata specifying who they are And servers can also add some extra metadata in that sense So they can say like this is a web server that's running engine x. This is a web server with version whatever So also in this case with the ingress It will attach the ip address in the port number of the pier As extra labels before forwarding it over to flower So the process Oh, okay So the process will then know like the ip address of the connecting client So it's also capability based namely the handles that are porting over to the switch pointing to the switchboard It can be duplicated and can be constrained and some extra labels can be attached So what you can do is you can actually enforce that a program That makes use of the switchboard always has to identify itself It always has to provide a label for every request saying My part name is So this is sort of resulting picture So scuba runtime now has a connection to the switchboard and for every container and pod it starts up that has a Kubernetes client or server tag It also creates additional handles to the switchboard. So those are these lines over here And then whenever the engine x and my sequel want to communicate with each other They engine x sends a request to the switchboard, which then creates a socket pair and hands out both ends To my sequel engine x. So that's this line over here So the question is Does all of this actually work? Well, let's see So Let me try to do this while holding the microphone So I hope everyone can read this on the left. There's a free bc server The font size a bit Oh, okay. Well, then I'll I'll try to hold it like this So now I need to type in my pseudopassword So what I've done now in the right terminal I've started up a kubernetes server Or sorry like an api server and on the left I've started up a kuplit So now we see that the kuplit. This is a free bc vm Registered itself as being kuplit free dot test of cloud evi, etc And there on the right you see that kubernetes is crashing. So that's a not a good sign, but well Yeah, yeah, it does it sometimes but now at least what I can do is I can run a cube ctl Get nodes And you see that there's now one node in the cluster called kuplit free So this is a free bc node. I could also describe it and then you can actually see that it's a free bc node But now what I can do is I can run a cute ctl create dash f Manifest slash web server for camo and nothing is happening over there Oh man, this is annoying So what I'll do I'll first finish the remainder of my slides and then I'll come back to the the demo So the annoying thing is I'm making use of like a like kubernetes master And I've seen that the api server sometimes crashes unrelated to any of the cloud evi things so Yeah, that's a shame You So we'll come back to that in a minute. So wrapping up. Um, well if it would have worked There is something called kubernetes. I think we at free bc should be using it as well If we're not just trying to catch up with linux We have some components in free bc already or some readily available that that allows to do that One of them is called cloud avi easily sandboxing With scuba we can run it on the cluster and with flower we can allow them to communicate over the network So what's my wish list for 2018 or the remainder of 2017 as well? I mean this so far has been like a solo effort. Um, there are some Number of people in irc are also hacking on cloud avi a bit, but the entire kubernetes thing is just me on my own I can only work on this part time. I also have bills to pay of course So that's why I also do some consulting work on the side It would be really awesome if I could somehow do this for full time But at the same time, I'm also really hoping for participation from the community So it would be awesome if other people could use and test this Hack and port stuff also helped me document this and promote this at other conferences, etc And also eventually fun and if that's so this it can become something sustainable So Most of this work, you know, this is there's only works on free bsd But in theory this could also be done for the other bsd. So even if you're not a free bsd fan Then and you're still interested in cloud avi. Please get in touch. I mean we can also always get this to work on other bsd So here's a bunch of links to github repositories. So the patched up grubiniti source the cloud avi definitions Scuba flower a lot of these things are already in free bsd ports And if they're not a main reason for them is because they are Cloud avi binaries and those are packaged in a separate repository that you can just add to package through a company file So this is actually all my slides. I'm now going to Going to look at Olivier. Is this still time for the demo or shall we shall we do the demo? Okay, so what I now need to do is actually not all that hard I just need to shut down all of kubernetes and throw away all of its state So i'm just going to do that now with two hands because I need to type Um, no, no, no, no. No, this is um, this is just a testing cluster consisting of two nodes Um, so the question is does the api server also always crash in production? And the answer to that to that is no The reason why this one crashes is that this is just like get master that I Checked out at some point in time and had to patch up to make work on free bsd So there's a small number of changes to actually make it like to add the free bsd bits to make work and also some cloud avi changes um That said this version of the api server always crashes like with a 30 20 chance. So Um Yeah, it's I just picked an unlucky revision revision, but the eventual version shouldn't crash all day long So now restarted kubernetes again On the master so the right terminal is a linux system running the kubernetes master And the left one is the free bsd system running the kubelet And now you can see that it has successfully registered again and now we don't see this ugly backtrace in this terminal, which is good So now In on the linux server, I can run kubectl get nodes and indeed we see that kubelet 3 has registered itself in the cluster kubectl create dash f We now see some output here in the terminal. That's because it has now started a couple of jobs if I run kubectl get pods You can see that it now has spawned three web server processes on this node in the cluster if I would describe them I would get more info You know, you could actually see in which node in the cluster it's running when it was started how much resources it is using etc um, one thing I could for example show you is how to sort of Increase the number of jobs. So if I run kubectl edit deployment web server I just get spawned in vim And I could head over to this line indicating the number of replicas running on the cluster So I would just say like let's change this to five and exit vim Then voila you see in the left terminal that it generated some more output in the meantime, but now we have five nodes running on the cluster So one thing that I could now do is like for the sake of this demo expose one over the network And you can actually see what this web service that I spawned on the cluster So I'm now going to run sudo flower Or I'll just type it in first and then explain to you what it does So this command over here where I say I'm now going to start an ingress That's going to listen on port 80 and send all traffic over to one specific pod in the cluster Namely server kubernetes pop name is web server, etc If I'm now going to start this Then this web server will be listening on port 80 Which has now worked it has spawned my web server. That's like serving some kind of Silly html page on the uh to the browser that I found on the events quite impressive There's one kilowatt of JavaScript and it's printing itself So right now you see that like the networking part through flower is sort of is sort of working Um The only thing that's sort of missing right now is load balancing support This is still something that needs to be added to flower So you could see when I started the ingress that I directed all traffic to one specific instance of the pod This of course needs to be extended that you can provide the name of a service to redirect all traffic for a service to all of its back end So this is still on the to-do So this is just a really tiny tiny demo to show you that like in fact This does work this does spawn jobs in a cluster and these are all sandbox all these web servers over here Please hack them. There might be security bugs in them The fortunate thing about it is these jobs all started up as such a way that they can only communicate with the gubernese log and The switchboard so the impact of that would be fairly minimal Are there any questions? Yeah, thanks for the talk. Um You said that the python was uh Ported to support cloud lbi right does the python program Need to support it as well That's a really good question. Um, so we have a version of python that works on cloud lbi The there is a python executable, but it doesn't have like the same startup Process as the traditional python one. So it's not just python space file name What you do is you write a yaml file In which you specify which include paths that python is allowed to use And you specify the file name of the python script that needs to be run And maybe some other resources on which your python script depends and then you can use cloud lbi dash run On the python interpreter specify of you know giving it the config file that uh lists all the resources that it's allowed to use Does it answer your question? What about the like network sockets and file descriptors open files and stuff like that So you're not allowed to call just plain open inside of python But all of the resources like directories in which you depend and network sockets They can be passed as arguments onto the python script And what happens is that inside of python? There's no sys dot arc b anymore. There's no sys dot and viran But there is a sys dot arc data that gives you like access to all of the attributes that you pass in in the yaml file But I can always use a new maintainer or more people who want to hack on Our our python port. I mean at one point I was looking into porting jango That sort of stalled halfway along because I got distracted about all of this kubernetes work But could always use more people looking into it Thanks Hi, I have two questions. If you permit me First While you have to pass file descriptors in advance you can pass a descriptor To a directory in cloud a bi and then you can use open ad beneath that directory to subsequently open arbitrary files So we can run interpret languages like php fpm and stuff open in arbitrary scripts Yeah, so if you would want to like have a web server that spawns up a separate php process to run a script Then you would also need to pass in a file descriptor of the executable of the php interpreter But passing in directories means that you can also access any files within and generate new file descriptors based on those things So it's it's not as if you're limited to the file descriptors that are being passed in a startup It's only that those limit to what you can actually infer so Any directory Underneath a certain directory can have its own separate directory descriptor that you can open later on Great, and why would you want to put the load balancing into flower instead of cloud ab find something like aj proxy? um, well, that's that's a A really good question Maybe we should Now so one of the things that I also um, so this is sort of slightly unrelated to well, it is somewhat related to load balancing Um, I also want to add some support for dns lookups not added inside of flower, but added as a separate process So one of the things that I didn't show my slides is that there's also In addition to an in an ingress there's an egress that allows you to make outgoing connections So cloud abi processes connect to jobs on the internet For that we also need to have dns support of course because only connecting by ap addresses just plain ugly doesn't make any sense and That is also load balancing in a certain way You know you're resolving a certain host name google.com. It may return return multiple ip addresses and then flower needs to sort of Take a look at all of those results and pick the first one or use some kind of logic to to make a successful connection so I think that like load balancing and dns support are sort of somewhat strongly related to each other and therefore we might see that this is being solved in flower, but This is still on the drawing board if you want to help out, uh, you know Have a have a nice chat about it. Give your input then, you know, just count for everyone Just ping me Any other question Could you go back to the sd slide for a moment? I want to take a picture Uh, sorry. Sorry, which slide? Sorry, this said Oh, uh, oh this one. Yeah. Oh, yeah. Yeah the rag axis. Yeah. Yeah, I get it. Sorry Yeah, there's another question back there. Are you going to upload the slides somewhere? So I can show you to linux people what crazy things you're doing and what might be actually interesting Yeah, so so my plan is whenever conferences have some kind of facility for uploading slides I I will so I'll probably Send it over to the program committee. Yeah, just send it. Yeah. Yeah, and then it will appear on the website. Yeah Any other questions? Why do bananas happen? there is a You mentioned the lack of traceability of applications in the traditional Kubernetes setup And is there any support for tracing like a simplified tcp dump or some similar command for Doing this with power um There might be some third party automation around it Kubernetes doesn't really provide something like that officially and I think one of the main reasons is because they Instead of focusing on looking at tcp packets They are looking at the bigger picture in a certain way and they're more interested in tracing Requests tracing rpcs over a larger cluster, you know rpc gets sent from a to b gets forward from b to c And there is a sort of an open source tracing Framework that's also part of the cloud native computing foundation, but it is no way focused on tcp dumping So right now what kind of strategy you have for the load balancing part? Because now you get control of all that you can maybe implement whatever you want Um, do you know any good strategies? If so help me out I mean, uh, you can maybe implement different implementation and uh, and based on the experience that we get uh, try something else and uh Have it as a playground for yeah differences. So one thing I did think about is that um, all of the like, um load balancing or at least Announcing which targets to which you can connect can be placed in a separate process that hence that info to switch board and In addition to just giving the names of the things to which you can connect We could also add like weight scores So if you would want to make sort of a more complex load balancing thing You could have some dynamic weight computation in it so that If one backend discovers that it's sort of being overloaded then it will reduce its weight Causing that less traffic is being sent over Something like that to a special point of presence to deal by zone is geolocation Um, yeah, so I have to confess I I haven't given it like the entire load balancing too much But yet this is all open for discussion. Yeah Okay, referring to last question. Did I get you correctly that flower load balancing and flower network switching is all about a single note So it's local load balancing to lots and services So so far it is so it wouldn't help in your global load balancing example. You still would have to imply different boxes On the cross, uh, uh, let's say globalize the Indifferent data center. Oh like for example, uh, there is giants I try to implement such kind of things Using the docker ipi and they do that in order to have a multi tenant And globalize the cluster. So yeah, so I think in in general you could argue that it's all the same problem It doesn't matter whether it's within a data center or cross data center But so I think our focus should firstly on like In data center load balancing because it's it's unwise to smear one kubernetes cluster over multiple data centers anyway, I mean What happens is that like the api server will still sort of remain a single point of failure and things slow down if you would actually try to Spread it out a lot it's it's often a lot wise to have like an every data center one kubernetes cluster and Therefore, I think that flower should for now mainly look at we should look into adding in data center load balancing and across Data center load balancing that also involves bgp and maybe All sorts of other like network logistical problems that are sort of out of scope for flower Yeah, did it answer your question? Okay Okay, so you had and uh now it's for time for lunch