 Well, thank you for coming My name is Brian Borum. I work for a company called WeaveWorks So let me just check who's in the room Who had heard of WeaveNet before? Before they read the schedule about half half the room. That's okay and who who identifies as like a kernel developer or a device driver or DPTK developer or They've all left. Okay. That's great. So so we can because I I put on this slide. I am not a networking expert. I'm I'm like a programmer and I've been looking after this project for for five years It's been downloaded 250 million times. It's Starred 5,000 times. There's There's a certain things to be proud of but It's yeah, fundamentally, I don't think I did anything clever so so I'm kind of glad that all those all the clever people have left the room and I Put my smiling face up Somewhat because I'm gonna put a bunch of people up You know, this is a talk somewhat about the technology and somewhat about the people and the history of this project So hopefully that's interesting So What is we've net it's a container network and I'll talk about that in a minute The Primary thing we were aiming for is that it's easy to install it just works It runs anywhere and there's a little asterisk because we mean anywhere that is Linux But Windows like runs Linux now We have tried this you can actually run weave net on WSL so Nearly everywhere And we never it's open source. It's Apache licensed. We never made an enterprise version So enjoy Yeah, what is a container network Well, we had we had one definition from Justin Garrison works for Disney I'm an environment of his work and generally the work of Disney. It's good stuff But he said there's no such thing as container networking So that was a bummer because I've been working on it for five years But actually turns out He uses we've net and it just works So yes, so So more seriously, you know, what is a container network? So the point of containers is is is or one point at least is isolation That through through namespaces and kernel features One thing is sort of believes it's completely separate from another thing Including network. They have separate network namespaces. So now how do these things talk to each other? Well, whatever the answer to that question is that's a container network. That's that's the definition. I'm gonna work with Okay, let's go back Okay, let's go here Conceptually, what does this mean? So I'm gonna draw I'm gonna put up a large number of Diagrams that look a little bit like this So the the meaning of the shapes the big darker blue shape Is is a machine whether it's bare metal or VM or something like that That's that's your that's your kind of node in the network machine and the light blue blobs are containers Okay So sometimes containers are talking to each other on the same machine sometimes they're talking to each other on different machines and By and large we're gonna have like lots of these things And they're all gonna be talking at once in in different amounts Okay, so that's the kind of theoretical high-level model that we want to that's what we want to do Go back. Let's go back five years five and a half years this smiling face Mr. Sackman wrote the first version of what became weave net and He was Out of rabbit MQ in fact the the founders of we've works all came from rabbit MQ So here's a Erlang programmer and the the code a Lot of the code has quite a strong. It's all written in go But it's a strong kind of Erlang flavor to it Which is kind of cool if if you want to think about that 3400 lines is the first commit and Spoiler we're at like 30,000 lines now, so it's it's it's rolled a bit Anyway, so basically what we do That version Fundamentally what we do is we we put a bridge a Linux bridge on each Machine we connect all the containers on one machine to the bridge and then we Tunnel the packets from one bridge to another bridge So that's the kind of the the the conceptual model taken to a natural implementation. Let's take that down a layer further So specifically we set up a bridge for each container we set up a veth virtual ethernet device One end is inside the container namespace one side is attached to the bridge in the host and we Listen on that bridge using pcap The room went quiet No, seriously, we tested three different ways to do it like tap devices and whatever and as that five years ago in go as it Stood pick up work best so So there it is So if you've got If you've got two containers on the same host, they're both attached to the same bridge They'll just talk to each other other over the bridge and and and we've net doesn't get involved If you have two containers on different machines Then we're going to pick up those packets. We're going to put them in Another UDP packet it's kind of homegrown encapsulation and we're going to send that over the network and and Deliver it to the bridge again via pcap packet injection on the other side deliver it to the bridge and That's going to end up at its destination This is I like to call it at least a distributed ethernet switch it is we've net implements an ethernet switch It's a layer to network It works purely in terms of MAC addresses It does what a what a dumb ethernet switch does it learns MAC addresses by seeing a packet come in And it and it observes the source address that it came from and when it later on it has a packet to send to that destination it uses what it learned And it delivers the The packets that way so so physical Ethernet switch is going to deliver on different cables. We're in software. We're going to send to different hosts It's exactly the same concept We also have the same felt fallback behavior if we don't know where it's supposed to go send it everywhere that behavior will come in useful later This is a real website Somebody was kind I didn't put their smiling face up Somebody was kind enough to make this website I'm pointing out something that we did actually know But if I if I step back Can I step back? Yeah, okay, so yeah, the reason it's kind of slow is We start off in user space here in the program that's trying to get some work done We go down into the kernel here. We go back up into user space through p-cap We put it in a UDP packet. We go down into the kernel again across the physical network And then do the same thing again. We go like up down up down up down up down and Yeah, it's kind of slow. It's terribly so it adds We used to measure it like five years ago used to measure at 300 microseconds per packet extra latency and Now, you know, you have to set this aside What are you actually going to do with those packets if the next thing that happens is they get delivered to a massive heap of PHP code then 300 microseconds is not your problem But whatever the the yeah, it's kind of slow. Okay, so next next step in the evaluation We implemented What we call the fast data path because we have no imagination when thinking of names So kind of similar picture The packet starts off in a container again, we've attached a v-th The other end of the v-th now is in a in a different device, which is a v-switch data path So this is implemented by a kernel module from the open v-switch project This is the only piece of the open v-switch project that we use So we're in these kind of these demon processes are basically implementing our control plane independently of open v-switch, but we're using their kernel module So it takes the place of the bridge and at least in this version of the code and we add we add a few kind of bridge-like behaviors to it to get everything we need out of it, but once a Source destination Mac pair has been seen to be talking on the container network We we set up a VXLan tunnel and that goes kernel to kernel So the the packets don't do this up down up down up down thing. They they are encapsulated which costs you a little bit But we used to measure this like on a on a 10 gigabit network, which we thought was fast in 2015 On a 10 gigabit network, we'd measure 8 gigabits of throughput, you know, so it wasn't that bad And it's doing encapsulation, but it's it's kernel to kernel and it's delivered To its tonight destination pretty fast So the person that did this was mr. Rag dr. Rag, I should say almost everybody that worked on we've net has a PHP PhD except me Sorry Yeah, so Like I say, I like to put up the smiling faces So that's the the fast data path Fixed our main kind of Obstacle that we had in the in the marketplace, which was that it was kind of slow Let's talk a bit about how we set all these things up So, you know right from the very beginning. We need a bridge. We need vets. We need to step into network namespaces step out again We need to Set up some IP tables rules. We need to do Set up some siskals, you know a bunch of things we need to do So how do we do all that? in a shell script we Borrowed Liberally From this project called pipe work by Jerome Petazzoni Who was a docker at the time I think and This project is a shell script. It turns out. It's actually really Concise to do the kind of IP net and s blah blah blah stuff so So we had our own shell script, which is called weave and It started off 350 lines and it it has these commands like we've launched and we've attached and so on at Peak it got up to two thousand two and a half thousand lines It's not a very nice place to be to maintain a two and a half thousand line shell script So I I sat down and re-implemented a lot of it into go So it's it's currently I mean this is the latest commit. It's at 1600 lines or so. It's The features keep getting bigger on the option. It works in all kinds of different modes and so on so that bloats it but But one thing is when recoding from these things in shell script to go It the code gets like 50 times bigger because you know go is notoriously verbose, but But there we are So what else we do encryption We do that Both ways with the fast data path and the the slow data path We renamed sleeve because slow data path didn't seem like a good a good branding position Corporately You know sleeve is a thing that encapsulates something anyway, so So cunning metaphor here in in user space we use the NACL library NACL sodium chloride salt. Yeah, okay To do our encryption in Kernel space we use the XF RM framework and there's a wonderful Explanation on that link at the bottom at all the minute details of how we do this One interesting tweak We couldn't get this to work at all for months Essentially because the the the open v-switch data path I Doesn't Provide any way to to drive the the packets through the exo-firm frame framework we can't set a policy to say everything on this data path go through here and Eventually the idea how to fix this we stole from Docker We we put all the packets through an IP tables rule which marks them and then set a policy on that mark So we have an IP tables rule whose only function is to glue these two bits of software together that otherwise Inside the kernel that otherwise don't play and that's kind of a Lot of the history of this project has been sort of fighting with things that didn't quite want to do what we wanted them to do That's You know the history is there in the code and some of it I can remember but anyway, so we we We encrypt the packets We we're doing key management up here We did not roll our own crypto and Yeah, some people like that feature. It's it's encrypted on this side It's encrypted when it hits the underlying network. It's it's not encrypted here You know so if you have if you've managed to get on to the machine and you can you can sniff this Veeth then you'll see the plane Traffic there But I always reckon if you're if you've got that much access to be on the machine and sniffing a veeth Then you probably lost the game already so Who knows What else? Oh, yeah, Martinez wrote this Martinez did all the Gluing things together at the XF RM level he now works on Celium at I surveillance which is which is that vendor So that's Martinez Okay, so change tack again We've not is a is a peer-to-peer network with it. I mean that The title of this talk is is like no central point of control and it's a pun on the management style and the technology we We wanted it to just be install and run whether you're running on your laptop or in the cloud or on a hundred hosts or whatever and What most people Did to put together a container network was was they they rely on something like xcd To to be a thing a central consistent store of what's going on to to hold all the container information or the roots or the whatever And we didn't do that So we've net is is completely peer-to-peer you can you can start with one peer and you can start adding more peers They talk to each other via gossip So I've given each one a little flag Each peer has an identity on the network and that that peer can be Present on the network or it can go away, you know You can close your laptop and take it on a plane and open it up again, and it it'll still work on the network So the way we do that is all the shared data structures are implemented as as CRDT's as as a eventual consistency Data structures there. They're specially designed so that we can do that like somebody can be absent for any number of hours and come back again and and the the data Reconciles it all fits together That is incredibly hard work so Anyway, it has this property that you don't have to set up at CD before you get started, but But it is very hard work We do this for IP address management as well we do it for several things but but one of them so we basically take an IP address space and map it on to a to a ring Data like a distributed hash table type ring the idea and then spread that across the network and And gossip updates to that ring. That's how we why we do IP address management Yeah Okay, I wanted to talk about the community a little bit So as I yeah, I think I have a chart here, so this is the When it says installs, this is what we get account from Docker of the Docker pool operations So it's running well over a million Yeah, so I mean it was up at two million. This is the one year the last year two million a week Down to about one and a half million a week We we see the software fire up a lot we don't as an open source project We don't have a very good idea of who's using it Right. We you know, we don't People people write in when they have a problem sometimes But they generally don't tell us just that they're using it and they're happy with it So this is one of the few bits of evidence we have the thing gets gets fired up in some sense a Million or two times a week Compared to that we get a we get very few PRs We we get lots of people coming along and saying things like, you know This is just one I picked on because it came up recently And this is over a period of a year and a half people complaining about a setting and saying why isn't that why don't you change the default? It's one line send a PR People don't know how maybe yeah So most of the work has been done by by people being paid by we've works This is the the github contributors list Fun fun statistic after being the lead on this project for five years on the second highest contributor Matthias Radistock Who is also ex-rabbit MQ he was co-founder of we've works He's still the number one contributor, but all yeah all these people work for we've works This this is the Mike Bryant is the biggest contributor who doesn't work for we've works Oh, we don't have a kind of long tail of people who did manage to come up with one or two PRs Which is great, and I I would like to encourage that, but It is a it is a little bit Dispyriting when People just want to kind of complain About about the software and demand that it does something else Okay, Kubernetes this is what you were promised right this is the theme of this day So So we've now is quite popular With with Kubernetes. I thought I just kind of run through what is what does that mean exactly and and what is it doing there? And how does it work? So Kubernetes Doesn't just talk about containers talks about pods a pod is a collection of containers on the same machine and So so in in Kubernetes world conceptually the the blue blobs are pods and but the same same stuff is going on They're talking to each other and and Kubernetes has a very small set of rules One of which is that any pod can talk to any other pod without going through NAT And Funnily enough The rules the the sort of model the networking model of Kubernetes matches very well to Google's network So I don't know if we'll ever figure out how that happened But you know What if you run it to GK, you know Google's commercial Kubernetes then this thing with the bridges They just they just have roots like IP root layer three roots from machine to machine So they have the same thing with the bridges, but they don't have they don't have anything else other than the Google network to Transmit packets between machines They just use Linux routing and let the Let the underlying network deliver the packets to a bridge at the other side and that just works if you Google It pretty much doesn't just just work anywhere else So there is a need for for something to take that place and and we've net is one of the things that people sometimes choose to take that place So back around the time this was getting popular which is about four years ago now The project rocket which came out of CoroS, which was kind of a competitor to Docker They had They had this very simple model for network interfaces where they would exec a process that Would add a network interface So that became CNI essentially Essentially some people including weave workers got in a room and and said yeah, let that should work and And it got it got named and it got turned into a project And I am I'm a maintainer of the CNI project But it the CNI is supposed to be really really thin I just thought I walk walk through what exactly that is So CNI is is not coupled to kubernetes it is like I said came from rocket It's completely independent of network and what we call a runtime So kubernetes is a is in the place of a thing we call the runtime and CNI speak And physically it's the the bit of kubernetes called the kubelet, which is the bit that runs on on each node So the kubelet calls a CNI plug-in When it calls it right now the interface is is it exact it execs a process in the in the host host namespace not in a container It supplies a JSON Config which lists a few things out like maybe which subnet you're supposed to be using something like that the Plug-in then talks I mean conceptually Conceptually you've got a network some you know somebody showed up with a network you bought one from Juniper or you installed weave net or you're using psyllium or whatever Somebody's got a network. So the job of the plug-in is just to be that little bit of glue in between to interpret this JSON spec and to cause the network to attach itself to a container So that's That's the idea of the CNI project And I think it's it's It's worked fairly well in in its goal of being agnostic and kind of staying out the way of people and I do quite often hear complaints that the CNI doesn't do this and CNI doesn't do that and The unfortunate news is it's it's never going to do those things because it's trying to be the thinnest possible Layer that could work for everybody This is Jason if you want to say extra things in Jason just add them just like party on just add fields Okay, that's CNI How do we get weave net installed? So I just mentioned The plug-in runs in the host as a process on the host And and everything we're talking about is is containers You know which are isolated So we get that we get round that by devious trickery we mount a directory off the host and Start when we've net starts up it copies the file into the host directory. So now it's in the host I As far as I know I invented this trick, but everyone does this now. So maybe I copied it from someone else Tell me at the end if it was your idea Yeah, so so kubernetes has this concept of a demon set Which basically means run the same thing on one on every node? And restart it if it dies that kind of thing So that that's how we fire up. We just we arrange for someone to ask for that demon set That that fires up a copy of our software on every node we do this trick with Copying a file on to the host And now we're away. So, you know kubernetes now gonna gonna call the plug-in plug-ins gonna call back up into the demon and That's how that all works I observed about Not having any kind of central consistent idea of what's going on and and of course in In kubernetes you have exactly that you have a the central what's called the API server does know everything that's going on in a kubernetes cluster And so a few times we thought about about abandoning the eventually consistent stuff and just rely on On what kubernetes is telling us which is what everyone else does And never quite got around to doing that Anyway, it's an idea if you want to submit a PR that'd be great We do we implement kubernetes Network policy which which was mentioned if you were in a couple of the previous talks So like saying who's allowed to talk to who? We do that by relying on what kubernetes tells us because you know, it's the only thing that knows all the labels on different things And and somewhat excitingly the the network is implemented at layer two and the network policy is implemented at layer three And they they essentially have no connection between them. We just run them as two separate processes in the in the same pod Anywho is that all I wanted to say I just skip over that. Yeah, that's that's Pretty much what I wanted to say. Does anyone have any questions? Yes Okay, so well, that's not a question. That's an observation, but the the observation was the I Okay, do we have any plans to support IPv6? So we've met has has no support for IPv6 in two ways It doesn't support IPv6 inside the overlay and it doesn't support IPv6 as a target on the underlying network So which of those two did you want? You wanted both of them? And may may I ask a question? I mean the whole point of Overlay networks generally is that you have some problem that stops you just rooting across From one container to another and that problem is very often an addressing problem in IPv4 So do you know what what problem in IPv6 you're solving by having a Like why can't you just root between the two containers? Okay, so you said your answer your point was that you need some pods Now my suggestion is that all pods can have globally reachable IPv6 addresses So you so you don't need anything else you don't need me to write any code because IPv6 will solve your problem All right, I think we have to take that that offline Yeah, I mean, you know bottom bottom line Nobody did why doesn't it support IPv6 because nobody did the work It's an open source project We've worked as a company found something much more exciting to do which is called gate ops And you should all buy that We never managed to Monetize weave net we never made an enterprise version and we never We never found anyone that was for instance willing to pay us enough money to do an IPv6 implementation Now thank you for the question any more One over here question is what was the contract race condition? So I should put Martinez's smiling face up Oh, I'm pressing on the wrong button to put Martinez up So in particular that that link at the bottom is not the right one to look for I Should have changed that sorry Okay, the I'm just trying to see if I can give you a short explanation of this basically it shows up when doing DNS requests Particularly on Kubernetes particularly using the muzzle C library And what happens is it it does At two requests at exactly the same time for the a record and the for a record Notwithstanding the fact that we don't support IPv6 the the The two DNS requests go out from the same source address to the same destination address so same source port same destination port and They hit a race condition in contract and one of them gets dropped. No, it's fixed in the Linux kernel Yeah, I like I say we spent most of our time not not on our own software we spent most of our time fighting other people's software including Linux and and And in some cases fixing it so that yeah, so Martinez wrote two patches He found three three race conditions. He wrote two patches to fix two of the race conditions So this is a this is a you can Google like Why why do I see a mysterious five second delay in my kubernetes? system So this is not the only reason why people see mysterious five second delays But it's it's certainly a very popular one The the the nature of the the requests They are from the same address source address same destination address same source port same destination port and Contract does not know how to deal with that Yeah, well, I'm times up. Thank you