 Thanks. Can everyone hear me all right? Okay. So, who saw my CoreOS introduction earlier this week? Okay, great. So, that's really good because I will not be spending a lot of time talking about what we're actually doing in the tutorial. It will be mostly hands-on on the command line. And so, I will be giving some introduction to the components as we use them within the tutorial, but it's not going to be in-depth. There will be a video of my CoreOS introduction from earlier this week uploaded to the conference website. So, you can follow along with the tutorial. You may be slightly confused, but the introduction will kind of clear up any lingering questions. So, I'm the CTO and co-founder of CoreOS. Originally, this talk was submitted by Kelsey Hightower. So, if you have any praise, it's at Brandon Phillips. And then, if you have any complaints or critiques of this talk, it's at Kelsey Hightower. That'll be where you'll want to send those, any of those things. And then, yeah, that's my GitHub profile photo if you have ever seen me on the internet before. Now, I know I appreciate it when people use their GitHub profile photos on their presentations, because I can't often recognize faces, but I've seen those little gravicon so many times that it's instant recognition. So, right. So, I'm going to just quickly give some people some context of what we're doing here. So, CoreOS, the goal here is to build a data center as a computer. Essentially, we want to have a number of virtual machines or physical hosts that are resilient to and have a number of virtual machines and physical hosts acting together to run applications. And we want to design for resiliency against individual hosts failing. So, I'll exit the presentation and then we can start dropping down to the command line stuff. This tutorial is laid out on a readme file on github.com slash phillips slash coreOS ops tutorial. Can everyone see the font size on the on the shell? Okay. And so, there's a few prerequisite tools that you'll need to build if you don't have them on your laptop and you'll need a working go environment. Since we're really short on time and this lecture hall is gigantic, I'm not going to really wait for anybody. So, sorry. What I recommend is that you use this. Readme file is your clip notes if you want to try this at home, but you mostly follow along with me because this tutorial is rather short. Yes? I'm just going to have it not labeled readme. Yeah. Yeah. There are binary releases of these things, but I couldn't anticipate what Linux's or OSX's or Windows people were running and it just complicated everything. So, where we'll start is we'll start up a single virtual machine that will be our control machine for this cluster. And I'm using Google Cloud, Google Compute for all this, for all the VMs that we're spinning up. And what I'm doing here with this command line tool is Google has an SDK. It has a command line tool called gcloud. And I'm specifying that I'd like to turn on a CoreOS instance that's running a fairly recent version of CoreOS. And then I'm sending in some user data to do initial configuration of that host. So, the user data is a YAML file. It is classically, or what it is, is it's called a cloud config. And this has been adopted by a number of different operating system vendors. And it's essentially a way of doing initial configuration of a virtual machine as it does its first boot. So, why don't we take a look at that cloud config and how we're configuring this control node. Can everyone still read the font size? Because it's really important that you actually be able to see the text. Okay. So, this is a YAML file. And let me see if I can turn this back on. All right. So, this YAML file does a few things. First, it turns on fleet and sends in a little bit of metadata saying that the role of this virtual machine is a control server. Meaning that this virtual machine will be running an API service that we talk to throughout this tutorial. So, this will be the primary IP address into the cluster. Because you have to talk to some API or some API and some IP address. So, we'll use this one VM as the canonical thing. Obviously, all the APIs we're talking to can be leader elected and they can fail over automatically and that sort of thing. It's just for convenience. We're talking to this one server. We're also telling this host that it needs to be running at CD. At CD is a key value store that is designed to tolerate individual host failures. And this is the backbone for various schedulers and tools that we'll be using. So, the schedulers of Kubernetes and fleet use at CD. And also the scheduler of our overlay networking system called Flannel will also be using at CD. So, this control node will be hosting the at CD cluster. We're only using a single member at CD cluster in this tutorial, but at CD is designed to be ran on five to seven machines for host resiliency. Again, this is a tutorial, so we're doing the simplest possible thing. And so, we start at CD, we start fleet, and then we also start a service called SystemD Journal Gateway. Later in the demonstration, we'll be exporting all the journal entries off of the host onto a hosted service. All right. So, that's the initial configuration. Essentially, we're bringing up a single machine that's running at CD fleet and then has an API endpoint for exporting log files. Everyone on board? Yes. Okay. So, it looks like G Cloud is having a good day. So, everything came up really fast. I got rate limited earlier. Don't spin up lots of machines. You will get rate limited. So, what we'll do first is for convenience, we want to use a few command line tools from our local laptop. The first will be at CD CTL. This is a command line or at CD cuddle. It's a command line tool that gives you easy command line access to the at CD key value store. So, you can set keys and get keys, etc. So, we'll set up a, we need the actual IP there. So, we'll set up a simple SSH proxy to this control host. And then we'll just blindly trust the internet to give us public keys without doing any verification. Don't do this at home. Yes, I always trust the internet. Right. So, what this did was it's forwarding all the at CD, the at CD that's running on the control host, the at CD port, which is 4001. We got an actual IETF assigned port recently. So, we just chose that random when we started the project. But we're forwarding that to our laptop. So, now we are able to do things like at CD CTL set foobar. Okay. So, we're able to talk to the at CD server that's running on this control host. Right. And then obviously, we can get keys back out. So, we're able to do things with this at CD cluster. All right. The other piece is that we'll be wanting to have access to the Kubernetes API later in the tutorial. So, we'll go ahead and set that up too. Oops. I'm bad at bash. So, this is another SSH proxy that's going to report 8080 to our local host. And then the last bit is that we will want access to fleet. And fleets are very simple scheduler that's designed around creating something that looks like system D, but is distributed across a lot of hosts. So, system D is in a net system for a single host. Fleet is in a net system for multiple hosts. You interact with it in a very similar way to system D. So, we will say that this control host will be the fleet CTL tunnel. So, now we can do things like fleet CTL list units. And again, just blindly trust the internet. All right. So, there's nothing running on the cluster right now, because it's a brand new cluster. Great. And then the last bit is that we have, since I didn't want to bother setting up DNS for the tutorial, there's a few configuration files and configuration parameters that are templated within this Git repository that we're working out of. And so, I'm just going to run a quick set to get the, to configure all that stuff. Essentially, it's just so that the worker nodes, so there's a control node, which is hosting all the APIs and stuff. And then there's going to be an in number of worker nodes. And so, those worker nodes know where to get information about the cluster. So, we will do a quick set. And then, uh-oh. What did I do wrong? Slash G. Okay. So, now if we do get diff, we should see that a bunch of stuff has been changed. So, essentially where to find that CD cluster, where to talk to the APIs, etc., in these service files. Great. Everyone on board so far. So, we have a single host and we're able to talk to it via a couple of APIs. All right. So, the next piece that we want to accomplish here is that we want to have networking between the hosts. So, we wrote a piece of software called Flannel. And what Flannel does, it's an overlay network fabric. Get the joke. It's Flannel. All right. So, what Flannel does is it creates a UDP-encapsulated overlay network that you can run fairly inexpensively on VMs. So, the problem that we're solving is that a lot of people have infrastructure where each host has a single IP. But our opinion and the opinion of projects like Kubernetes is that it's really convenient if each container, each application that's running in your infrastructure has its own unique IP instead of trying to do port mapping business. And so, what Flannel does is that it essentially creates a logical route table between a virtual machine that has IPA and then another virtual machine that has IPB and then it overlays a 10-dot or whatever sort of subnet network you want on top of those two things. And so, each host gets, say, a slash 24 or slash 16, however you want to allocate networks. And so, in this case, what I've done here is I'm storing it the configuration of the Flannel network saying that I'm going to use 10-dot 244 as a slash 16. And Flannel will sign subnets out of that IP range to each individual host. And then as things come out of that host, it'll route to other 10-dot 244 networks or other 10-dot 244 addresses between and to the appropriate virtual machine. Everyone get how that works? Would a diagram help? I'm seeing some nods, so I think everyone gets it. So, we set that configuration because the actual worker nodes are going to want to bring up Flannel and essentially do this subnet assignment to themselves. So, what I'm going to do here is we're going to spin up five worker nodes that are going to connect to our control node. And these worker nodes, I'll just start it because the Google Cloud can take a few seconds. But they have a similar to the control node. They have a Cloud config that describes the initial state. Essentially, they're running STD, they're running fleet, just like the other members. So, they'll have an STD server and they'll have a fleet server running on the host. This is a little ugly. We cleaned it up recently. And it actually uses containers now. But that was like a week ago and I forgot to update the tutorial. But essentially, Flannel is a daemon that runs in a container that does this IP assignment and then tells the Docker engine you need to use this network interface called Flannel zero. And you need to assign IPs to the containers out of this IP range. And then we have to configure Docker. Again, this is stuff that you don't have to do anymore, as I have a couple of weeks ago within CoreOS. But this is muck to set up Docker appropriately. So, it's bringing up STD, Docker, and Flannel on each of these worker nodes, hopefully. Right. So, the five worker nodes came up on Google Compute. They have external and internal IPs. Great. So, we should... Now, since they brought up fleet and they talked to the control node when we do fleet CTL list machines, we should see five members within that cluster. Perfect. Well, actually six because the control node added itself also. And you'll notice that the nodes have particular roles. So, there's role equals node, meaning that they're just regular working machines. And then there's role equals control, which means that that's our control server. And you can use these roles. Essentially, it's just an arbitrary key value pair of labels. But you can use this metadata to say where work lands. Obviously, you don't want your super heavy Hadoop workload or whatever to land on the control node because the control node has other important things to be doing. But this is how you can sort of start to plan out the roles of different things in the infrastructure and land work at the right place. Yeah. Yeah. Well, so, the labels have to be unique. So, you can have role equals control and then environment equals production or whatever. But it has to be a unique set of things. So, if you wanted to have multiple roles, you'd want role equals control, comma, node or something. And your application needs to be aware of that sort of stuff and how it's using that API. All right. Great. So, why don't we take a look at what actually happened in SCD with all this? So, you remember that we used SCD CTL to write into the SCD key space, the shared configuration of how the network should be laid out for Flannel. So, what is also stored in SCD is essentially each of these hosts does a master election on some subset of the IP range that we've allocated. So, you'll notice that people have done master elections on these slash 24s. So, .39, .22, etc., etc. And so, what's happening is that the Flannel service running on each of these hosts periodically is saying, hey, I'm a host. I'm still alive. Don't give this lease away. It works very much like a DHCP lease. Only since it's deciding where IP traffic is routed, it needs to be fully consistent because you want to make sure that those packets that you're routing to the hosts are actually going to the hosts that you expect. So, SCD CTL, get, and there should be some metadata in there. I don't know why it's frozen. Oh, there's no sync. This flag no sync is because of the fact we're using an SSH proxy. But, essentially, the metadata that's stored in that key is the public IP address of the host that is hosting that subnet. So, we know where to route those packets to and de-cap them. So, we in-cap them as they're exiting and de-cap them. All right. So, why don't we look at how this actually looks in practice? So, we'll SSH into one of the nodes and then we'll SSH into the other node too. So, it seems to be taking a while. And we'll run busybox on this host. I know for a fact that that is the incorrect command line. Oh, it actually worked. Hooray. So, what we're running on each of these nodes is actually a really bad layout. Let me change this. So, what we're running on this machine here, if you can see the command line, is we're running busybox and we're saying we want to use netcap. Let me run this other one. So, we're running busybox. We're getting the IP address. You'll notice the IP address exists in that 10.244 range of the overlay network and it is the .39 slash 24 subnet of that overlay network. Makes sense. And then we're running netcat listening on port 80 from this container. And then what we can do to just prove that all this in-cap craziness and master election of IP addresses is working is we can run netcat on the other side as a client. And then hopefully if I say hello Auckland. Woo-hoo! Yes! Round of applause now. So, yeah, so the overlay network is working. We're running two containers that each have a unique IP addresses in the environment and able to talk to each other over TCP. Awesome. And so, this is the basics of how the network can work. Obviously, you can not use Flannel. Flannel is an option for people who have VMs or have infrastructure that has really flat networking where you can assign multiple IP address to a single host. If you have SDN environments or whatever, you have to make Docker aware of that and it's fairly straightforward. If you have things like, let's turn, if you're using Google Compute, they actually allow you to assign multiple IPs to individual hosts like thousands of IPs can land on a single VM. So, if your network supports these sorts of things, you don't need something like Flannel. But in a lot of cases, people don't have these environments, so you need something like this to do this overlay. Alright, everyone on board? Any questions so far? I've been really quiet, everybody. No coughing or anything, so fantastic work. Thank you, thank you. Alright, so this gets you set up in an environment where you have containers, you have a number of hosts, and we haven't really done anything that interesting quite yet. One of the questions that we do get asked quite a bit is that, is that people would like to be doing things like log aggregation or service monitoring and that sort of thing. So, I did want to go through and show how that sort of stuff can work on, let me see, hopefully I didn't lose my, there we go, and show how that sort of stuff can work. So, there's a hosted service called LogEntries, and LogEntries is one of these things they can take over TCP your logs and export them off of a host. And so, they have an API key based thing, and we, there's a container that takes the LogEntries agent and hooks it up to the systemd journal and exports your logs at runtime over the internet. So, again we use etcd to store this important configuration data of the API key. You can imagine wanting to encrypt this first, what other systems do, like there's an HTTP load balancer backed by etcd called Vulkan, and they have a pre-shared key. So, each of the hosts has a key that's laid down on disk, or that the system administrator logs in when the process comes up and types in the password, and then that pre-shared key then is able to decrypt the TLS private certificates that are held in etcd. But that's left to the application to figure out how to protect their secrets if they're interested. Okay, so, right, we set this token and then we can start a global service. So, one of the things that Fleet does is it enables you to run a service across all your hosts. So, we'll take a look at this service file. Who here has used systemd? Like, and written service files. Okay, so, if you're a system administrator or a developer, the good news is that everything we're about to say is extremely relevant because everyone will be using systemd soon enough. Welcome to the future. So, it's actually not that scary and bad. The internet, it turns out the internet has this tendency to overblow arguments into hyperbole. So, what systemd does is very similar to how things like sys5 init or upstart have worked. So, in the past we've had things like Apache 2 start and then upstart came along and it became like start Apache 2 and then systemd came along and like completely revolutionized the world and now systemctl start Apache 2. And this is what a lot of people are super upset about. And then, along with that, one of the big problems with things like Etsy, initd, Apache 2 is that it's this 400 line bash script that does a bunch of random stuff and it implements this logic that every daemon has to implement of like saving a PID file somewhere and then checking configuration files and then checking that PID file and helping the right thing if somebody does a reload, just kind of all this boiler plate stuff. And so, what systemd has done is it's a static file that doesn't allow for any sort of turning completeness at all. So, you say things, I'll just delete some of this, but you can say things like exec start pre, so things that should happen before and then you can say other things like I want you to run this single binary. It doesn't allow for any sort of turning complete stuff. But in the case, so a service file can be as simple as we'll write one really quick. It can be as simple as this. We'll do something really useful, like user bin sleep 5000 or 4000. And so, this is a fully compatible service file. This is all you need to start a service. And so, this can be, you know, a Python script, this can be bash, whatever. And systemd does a number of nice things, like it'll automatically encapsulate it in a C group. When you run systemd kill, it doesn't rely on any PID files or anything since it's in a C group, it actually nukes the entire C group. So even if your application double force or quadruple force, lots and lots of processes out and they essentially are no longer parented correctly, it's fine. It will clean up everything that was ever forked out of that C group. Right. And so, what we're going to do here is this is a service file that would run under a regular systemd system, but we also have this additional section called x fleet. It now has been simplified. No, it is still x fleet, which means this is a service file extension. In particular, it's for fleet. And it has just one entry saying global. So, since this is our log exporting daemon, we want to export all the logs from all the hosts to this hosted service. So, we'll do fleet CTL start and then this thing. And what should happen is that this will land up via the fleet's API inside of xcd. Each of the fleet agents running on the hosts, remember we told each of the hosts to be running fleet.service. Each of those fleet agents will notice that there's been a change in their configuration and then they will start running the service. So, if we do fleet CTL, fleet CTL list units, we should see that the thing is started. So, you'll see that each of those services is active and running on those individual hosts. And then if the internet is still working, yes? So, yes, this is a good question. So, what's happening is that the fleet agent is looking into xcd essentially and it's checking for any work that's been assigned to this individual machine. So, node one is checking the node one work list and saying, oh, is there stuff in that list that I'm not running? Okay, I will tell system D you need to run this service file. So, it downloads the service file out of xcd and then tell system D on the individual host, run this. And then it reports the status of that process. Like, if it's running and active, back up into xcd so that the cluster sees, yeah, I've requested five of these things and those five things are running, in fact. And that's where this metadata is coming from. And then you can do all sorts of really adorable things like SSH, fleet CTL SSH. This is one of the reasons Go is awesome because they implemented a full SSH client in Go. But, based on the metadata stored in the fleet API, I can do things like fleet CTL SSH and then one of the machine IDs from fleet. Or if individual hosts are running, I can do like fleet CTL SSH and then the name of a service file. I'm going to make sure I didn't lose my terminal. Okay. Yeah. It doesn't follow the xcd, it's infinitely in shocker with the use of sort of key balance. Yeah. What's the xcd file storing when not loading some of the things that are not loading? I don't think I follow. So, the use case is that you store anything that you need to have as configuration in xcd. And so, storing ASCII text is a perfectly good configuration format. So, I don't think I follow your question. Yeah. Yes. Yeah, that's what we do. So, we have... So, the question was like how do we store these service files in xcd? And the answer is that we have a list of all the service files that have been uploaded by users and those exist as essentially an xcd directory. So, they're prefixed. It's like corvus.com slash fleet slash service files or something. And each of those service files exist as its own key. And then we point to those and tell machines you need to be running x named service file. Yes. Hi. So, the dependencies of fleet and the underlying worker are both talking essentially indirectly via etcd, is this correct? Yes. That's how fleet is implemented. So, the fleet agents need to have access to the xcd server. Right. So, essentially by setting and updating values in etcd, they're communicating with each other. Right. Well, so, there's no horizontal communication within fleet. Essentially, all that happens is we master, we leader-elect a single process called the fleet scheduler. And then we have every machine runs the fleet agent. And so, the agents are just dumb and they are doing this loop where they look and they say, what should I be running? Am I running that? No. Start the things that I'm not running. And the fleet scheduler accepts API requests. And then, based on what the user has requested, so the user requested, in this case, five of these journal offloader thingies to run. Based on those requests, it farms that work out into individual machines queues. And then those machines are just, again, looping, waiting for work to land on them. And then, if the scheduler notices, hey, somebody requested five of these things, but only four is running, it'll choose a new host and assign the work to that host. Right. So, these agents are using systemd to spin up our cgroup or docket containers, however you have it configured. Correct. Yeah. We thought this was a really convenient abstraction, this idea of using systemd services, but then giving you essentially leader election and replication of services for free without having to do any more work than take that exact same service file that runs on one host and then run it on lots of hosts. Yes. Cool. So, in the example you've just run, the fleet CTL journal, blah, blah, blah, yeah, service, that started on all of your hosts. Did that start on all of them because the global equals true? Yes. Okay. So, I could have said, you know, some other thing and it would have started on five of them or... Right. Exactly. So, what we use, the pattern that we use if you want to start like five of something is we'll... Right. Yeah. So, the pattern that we use is we do like expansion. So, we'll do something like this. We'll talk about how other systems do it. So, the other examples in this talk will be around Kubernetes and Kubernetes has a slightly different way of doing replication of a single service. But this is a system D pattern of essentially doing a templated unit. So, you say whatever the name of your unit is, and then at sign and then some identifier. Yeah. All right. Everyone on board. So, hopefully if LogIntries.com is working right now, we should be able to see some log data streaming in. Woo-hoo. So, we have real-time log data coming off of these hosts. And oh, looks like we're trying to get... People are trying to hack these servers. Perfect. Getting invalid SSH logins for user XBN. Awesome. Perfect. So, our machines are actively getting hack attempts and we're able to see that in some sort of centralized dashboard. And we use the single service file and fleet to form this out to our entire infrastructure. So, and it was this file that did it. And this, in particular, this Docker, this container that is running the agent for the LogIntries hosted service. And then it's talking to the system D journal socket to actually get the logs off the host. Great. Right. So, those are running throughout the infrastructure. Now, another common question that we get about CoreOS is where are all of my system administration tools? One of the things that CoreOS does not do is it does not have a package manager. We believe that containers provide all of the utility of package management, but free the individual operating system vendor from having to ship and snapshot all of open source at a given time. And there's some nice fallout from this. Since we're freed up from having to snapshot and ship all of open source, we're able to concentrate on the pieces of code that we think are important and that are required to get your application running. So this is things like your container runtime system, SSH, and the kernel, most importantly. And these are the primary things that are inside of CoreOS are these three components of SSH, the kernel, and a container runtime system. All right. But we do have this cute tool called Toolbox that we that we wrote. And what Toolbox does is it downloads it downloads a base image of whatever operating system you'd like to have. By default, it downloads the base image of Fedora. And so it downloads that base image and then extracts it to disk and then instead of using the Docker runtime, which doesn't have a very granular way of deciding which namespaces I want to have access to. It uses systemDN spawn in the background so that we are running in a container that has access to most of the host resources, like the process ID namespace, the network namespace, and a few other pieces. And so what happens is that we've now been dropped inside of a Fedora environment on top of this CoreOS host. And so one of the things that you often want to run is like TCP dump on a host. Obviously, we don't ship that because we don't have a package management system. But you can imagine that and one of the other things that Toolbox does is it enables you to create... Why is it not downloading? Yes. It enables you to create things so that each, say you have two users or two sysadmins in your environment, Joe and Sally. And Joe really prefers Debian but Sally has worked in a Fedora environment for her entire career. You can use Toolbox so that when they ask the station to the CoreOS host, they get whatever their preferred sysadmining environment is. And so they can use whatever tools they're used to using. They can use their Emacs, whatever, etc. All within essentially a sysadmining container. And then you can clean that whole mess up when they leave. So it just nukes the entire container's root file system when they log back out of the host. So you're left with a consistent environment again. Essentially this is solving the problem of every single host that you log into is slightly special because that sysadmin left behind their Emacs configuration that one time. And then they forgot that they installed GCC because they needed some special network debug tool, etc. It kind of solves that problem. So what we'll do as just a demonstration... It's called Toolbox. It's a really simple like 40 lines of code, but it integrates SystemD and Spawn and Docker. So we download the Docker image and then use it in Spawn. And then apparently they removed IP route from the base Fedora image. So I can't run IP. Super minimal. Okay. So what we have is the... Is this flannel bridge. And so we'll start to dump all the traffic that's coming across our flannel overlay network bridge. And then we'll log into another host here. And we'll do ping so that we can see that the TCP dump is actually running as we would expect it to be running. Of course I lost my terminal. Oh no. Where is it? Right. So we have an ICMP from... I'm not actually pinging the right address, actually. But you get the idea. Essentially we're able to do things, administrative things, like do a TCP dump, et cetera, from one host that has no tools installed and verify that networking is working, et cetera. Any questions on Toolbox or the utility there? Yeah. You look out and the door stuff disappears. Not automatically. You can do that. So if you're saying it cleans itself up, we can do that. It doesn't. You can, though. Essentially the question was, you said that the Toolbox automatically cleans itself up. Essentially it's just a static file system. So I think it's VoraLib Toolbox. Yeah. So it's the name of the user and then the name of the environment that they want to run in. And so you can imagine having a cron job or something that would just nukes this directory every once in a while. Yeah. So it comes with a container that's running the Shelley lessons with the fossil systems. Right. And it's mostly just a hot cache. So if Joe or Sally logs back in, their tools are still on that host. But you'd want to definitely set up some type of timer or something to clean that up. Yep. Any other questions about Toolbox or the utility there or how you can use sysadmin tools on a minimal operating system or any other questions on how that works? Great. Okay. So what we're going to do now is I wanted to show other sorts of schedulers running on top of CoreOS. And who uses a scheduler-based system or is familiar with schedulers in general? Okay. That's very few hands. So I'll do a couple of slides just so everyone's on board and knows what we're talking about and want to talk about the scheduler-based system. All right. Present. Okay. So we've talked about containers a bit. We talked about CoreOS and the reduced API contracts there and how they do stuff. The next piece is clustering. And this is where the schedulers come in. And this is where we start to talk about designing infrastructure for individual host failure. And the first piece of that is at CD where all that stuff's stored. But the other piece is scheduling. Essentially, scheduling is about getting work to servers in a manner that doesn't involve human beings saying, I have five servers. I have four services. How am I going to map this out? And not having to drop down to Excel spreadsheet to read me docs to map things out to machines. So selfishly, it begins with you because you're special and the universe revolves around you. And the computer should be doing the work that we want them to be doing. And so you have a request for the computers saying, let's say we have infrastructure of 50 hosts. I have a service running it that's already been built inside of a container that is able to handle five requests per second. And I know that my load is going to be 500 requests per second. So I need 100 of these things running, these processes running, in order to handle the load that I expect. So you describe in a document, hey, make sure these five, this thing is running and make sure that enough of them are running, 100 of them are running. So you describe that in an JSON document, I need 100 and I want it to be this application. The scheduler then has this active loop. The active loop is what did the user tell me to do? It told me to run 100 of these things. What's the current state of the system? Zero is running. What's my to-do? I need to have 100 of these things running. What's the list of machines that are out there and what's their capacity? What sort of resources do they have available? Divide this 100 things up across those machines. In the case of 50 machines that are completely unloaded, ideally it's two per host. And then the machines get this manifest of you need to be running these two things, you need to be running these two things, etc. And this is the basic process of a scheduler. And this is how things work internally at Google for almost all of their services. They have a system called Borg. This is how things work at Twitter. They use an open source project called Apache Mesus. And a lot of these organizations that have a lot of different engineering departments managing a lot of different pieces of infrastructure. One of the large expenditures in their environments is the capital expenditure of buying servers. And so they don't want every time an engineer comes up with a new crazy idea to then have that engineer go off and spend 10K on new Dell hardware as much as that is a fun thing to do. And we all enjoy looking at Chinese. It is not a very effective use of either the developer's time or capital resources of a company. And so these schedulers have emerged because it frees engineers from having to think about capacity in terms of racks and servers and think about capacity and what does my application need. And then I'll just tell the environment make this happen on my behalf. Everyone get that? Okay. And so what Kubernetes is, is the scheduler. Fleet, the tool we used earlier is also the scheduler. But Kubernetes adds a bunch of things around doing automated rollouts and canarying, et cetera. And then service discovery is a big part of Kubernetes that defines how service discovery should be built. So it's a much higher level set of tooling than Fleet is, where Fleet is trying to be system D for lots of hosts. Kubernetes is providing a lot of primitives for how you might run web scale infrastructure. All right. So there's a number of moving parts in Kubernetes. The first is the kubelet. And so we'll start this unit file. Uh-oh, maybe on the actual host. Here we go. So we'll start this unit file for the kubelet. And the kubelet is just like the Fleet agent. It runs on each host. So that's why we're using a global unit again. It runs on each host and it watches for work and executes it. The next piece is the proxy. The Kubernetes proxy, it exists on each individual host and it's used as a mechanism for service discovery. So it will talk about labels and services later, but the basic idea is that within my environment, if I hit a particular port and IP address, no matter what host I hit that IP and port on, I'll be redirected to wherever that service is running. And that's via a reverse proxy called the kubelet or the kube proxy. The next piece is that I'll be running the API server. This API server is going to be running only on my control node. So it, as an example of how Fleet services work, we have this roll equals control. And so it ends up landing on just the control node. And this is the HTTP service that I'll actually be talking to in order to make Kubernetes API requests. There's the controller manager, which deals with replication controllers, which we'll talk about in a bit. There's the scheduler, which is the piece that actually takes work that the user's requested and transforms that into lists of work that the individual kubelets will execute. Again, that'll run on the control machine only. And then the last piece is a thing called the Kubernetes register, which takes metadata about the hosts that exist within CoreOS and uploads it to Kubernetes. So the Kubernetes API has access to all the same metadata that the CoreOS host has. Mostly it's about what the networking topology is like at the moment. So with those services, we should, with any luck, have a fully working Kubernetes stack. And we should be able to see them within Fleet. Looks like everything's active and running. So fantastic. And we'll be using a command line tool from my laptop called kube-config. This tool has been replaced by a new tool called kube-cuddle. But kube-config is what this tutorial was written against, and kube-cuddle is like a month old and way bugger. So I'll stick with kube-config right now. And so what's happened is that, very similar to how Fleet CTL has list machines, kube-config, they used to be called minions, but they've renamed them to kubelets, but the name still sticks around. So this is listing all the worker nodes within the infrastructure. So we have these five worker nodes, and those are nodes one through five that are these CoreOS VMs. Everyone on board. So this is our compute infrastructure. And this is all being talked to over an HTTP interface. kube-config uses an HTTP interface. And SSH proxied into the control node, and that's how kube-config is actually working and talking to the infrastructure. So what we're going to do here is we're going to deploy a application. Actually, let's back up for just a second. Who's here built a Docker container before? Okay, it's about half the room, which makes me a little nervous. So I want to give a quick three-minute digression on how to build a Docker container just so people are comfortable with the concept of an application running in a container. So I have a host over here that I've been using for rocket development and testing out on Ubuntu, but this host has Docker installed and a Go environment. So this virtual machine on this virtual machine I have Go installed, and I have this really simple Hello World application that exposes an HTTP endpoint. And so what I'll do is I'll build this Go source code and I'll build it as a static binary so it has no external dependencies. And so on any Linux host, this will just run without libc installed or anything. You can literally just like make this an ndrd from the kernel. But this is the binary that'll actually come out of this is Hello World. And if you run file over it, it'll say it's an elf 64-bit Linux binary. And then we have this Docker file. And the Docker file is the way in which containers are built within Docker. They can do really sophisticated things like building the source code, etc. But we're, for the sake of this tutorial, we're going to do a really simple thing, which is we're going to say, again, Kelsey Hightower made this. I'm Kelsey Hightower. We'll say this is the maintainer of the thing. This is the from line here is the root of the container and the scratch container essentially is a completely empty file system. The maintainer is Kelsey. We're going to add this binary, this Hello World binary into the container and it's going to be called slash Hello World. We're not going to put it anywhere fancy. We're going to tell the system that it's going to be exposing a single TCP service on port 80 and that when you want to start the container app, you're going to execute slash Hello World. And that's the entire metadata for this container. So I'll run Docker build minus t where I owe Kelsey Hightower. So world. Yep. And so what this will do is it'll take this Docker file. It'll take the binary that we've already built and then it'll create a container from it. And then I'll run studio because apparently I didn't set the group properties correctly. And then after doing all these steps, it'll tell me what the container, the image ID is. And then I'm able to do sudo docker run quay.io slash Kelsey Hightower slash what I call it. Hello world. Uh-oh. Maybe I didn't call it that. Oh shoot. Okay. Live demos are hard. Okay. So this will start up the process and then I should be able to curl it, I hope. Oh, this is bad. Oh, I know what I did wrong. Typing is so hard. I used to have a linear algebra teacher who said that linear algebra is easy. Linear algebra is easy. Arithmetic is impossible. See what I'm saying? Yes. Thank you. Impossible. Okay. Perfect. So it's responding. We should be getting like HTTP headers back and stuff. It's just an empty body. Okay. So that's how you build a container. You can do a full stack from source to container or you can have a pre-built binary and import that into a container. And then it essentially runs like a static binary. So everyone on board with how a container is built, you get it, you're not afraid of it anymore. It's very simple. Okay. So you can imagine that these containers can be built automatically using a CI system. And like we have our, it's, we pronounce it quay because we're not familiar with the word. But you can integrate this service with GitHub and it'll automatically create the container for you, etc. But that's besides the point. The idea is that you have some sort of URL that is your application. And you tell other things in your infrastructure, run this URL. Okay. Let me find my terminal again. Okay. So what we'll do now is we'll create what's called a replication controller within Kubernetes. And the replication controller is this thing that tells the infrastructure, I want X of these things running. So it's a pretty straightforward JSON document. So, yes. And it has a few pieces of information. The first thing that it has, actually I'll run this in Vim so you can have a cursor to look at. The first thing is we say how many of the thing we want running. This is the JSON document that we'll make an API request against Kubernetes API for. We tell it what things it's looking for, what are the things that it's actually replicating. So it's replicating a thing called hello. It's in the environment production and it's of track stable. So in this example, we'll have a stable track and a canary track. So we'll have some copies of our code that are running a new version of the code that are added to the load balancer later in the demo. And then there's a template of what this application looks like. And this primarily describes the container that should be ran. So we're going to have the container. We'll name it hello. It'll start from the image that we just built or a similar image called hello and a version 1.0. And then we can put optionally some constraints on it of how much CPU. So we're going to constrain it to just 100% of the CPU and then how much memory it can use and then the actual port that it exposes as its service. And then we assign some additional labels to that application so that we can find it later and we can find any of its other copies later. And that's the primary thing that we're going to be telling the API. I want one of these things running. You can find the code to run it over here and then it has these metadata properties. It's going to be in track stable and it's going to be in environment prod and you can use whatever sort of labels you want. You may have like owner equals Joe. You may have billing group equals accounting, etc. So it doesn't these things are just arbitrary but you kind of have to agree with them inside of your organization. There's some recommendations obviously like environment of prod and dev but they are freeform labels. Okay, yes. Why don't you have to repeat port when the docker image already adds? Yeah, that's a good question. The reason that you repeat that is because Kubernetes isn't super opinionated on the container runtime and so right now they do use the docker stuff but you can imagine other things. Right, so what we'll do is we'll take this JSON manifest and we'll talk to the API via kube config and we'll say I want you to create this replication. They call it a replication controller essentially I want X of these replicas of this thing running. It's called a replication controller and so we tell the API do this on our behalf and so this thing is that while true loop that we were talking about that's constantly looking are one of these things running? No, make sure it's running. Are three of these things running? Oh God, make sure there's only one running like kill those other two things and it's just doing this thing in a constant loop. So we can do kube config list replication controllers and see that there's this hello stable controller which is the name of our controller. It's for running the hello application on the stable track and then the environment and tags and labels that it's in and what should have happened is that since we requested that one of these things be running it should have created a pod and a pod is an abstraction that exists within Kubernetes and a pod is the logical combination of a number of containers. So imagine that you have a container that is an HTTP server and you also want to inside of that container export the logs to another service. Those two things are logically like one application and so Kubernetes has this abstraction of a pod. It's like P's in a pod or like a pod of whales that are together so you schedule a pod as a single unit in our case we only have a single container that makes up the pod but you can imagine having this multiple containers. Another example would be like I have nginx and then I have some fast CGI daemon that I have to bring up behind nginx and these two things logically are together and should be scheduled together. Everything get the concept. So what should have happened here is that when I do kube config list pods is that I should have a single copy of these running on one of my machines. So it looks like lucky number 7207ADD7 was the machine that got assigned the work and it tells us that it's running on this host what the IP address is the IP address that was assigned out of flannel etc and that it's currently running perfect. Now things get slightly interesting because we want to take that same JSON file that we just had and we want to bump this up to say four so we have multiple, we have resiliency against host failures. So we'll run kube config with that same JSON file but instead of saying create replication controller we'll say update existing replication controller and the kubelet agents should be looking for all that work so great. So we see that there's now four of these instances running inside the environment awesome. Now this is interesting but all I've done is I've split a bunch of processes over your environment and you have no idea of how to talk to them not super useful. The entire point of this is that you load balance something behind example.com and you're making loads of money and so this is only useful if you are able to create a service and so what happens in Kubernetes is you have the processes running and then you take those labels those labels of the name of the service in our case hello and you create a load balancer called a service that load balances all those processes behind it. So let's look at the hello service JSON file and so this says that the container port that it's looking for is port 80 it's going to expose the service on port 80 and it's looking for services named hello in the production environment. So you can imagine having a service that load balances only the development environment or load balances prod and development together because you want to have a mix of things and expose them on different ports different IP addresses so that developers have a way of mixing these things together. Yes. That specifies the health check used to verify that the machines are up. Yeah so this is a simple TCP reverse proxy you can imagine having more sophisticated reverse proxies but it implicitly does a health check because it's a reverse proxy so if it tries the proxy you do to somebody who's not up it'll get a connection refuse and it'll choose the next one it's like a like HA proxy or something it's a round robin load balancing TCP proxy but again you could imagine using the API and exposing an HA proxy service instead that then load balances the things behind it. That's is it the control node that's doing the that appears as the load balancer? No so this load balancer will end up running on each individual host and so this is so that you can have access to all the exposed services within the environment and this it's getting a little more sophisticated within Kubernetes they have the idea of a portal and so each service will get a unique IP assigned out of the IP space and then have the port attached to it and that that combination of things is called a portal and that essentially is a floating IP in the environment and you can always talk to it and the network is in charge of ensuring that you're able to talk to that thing so this is the most simple implementation of services but you can imagine more sophisticated things that use virtual IPs and that sort of stuff So the the the virtual IPs just going to move from machine to machine? Well the individual the individual working machines are in charge of ensuring that the hosts have a route to those virtual IPs Okay, thank you Yes? Yes With CTD, running fleet, fleet control and a fleet engine You've got a control mode and you've got working modes and then on top of that you lay the Kubernetes which is laying down a whole lot of services services and properties and now you're exercising more infrastructure Yep, yeah so that Do you have a diagram of that? Yeah, I don't have one built up I wish I had somewhere to draw it But Is the system old for a long time? Yeah, I completely agree So right, so we can kind of break it down So we have the CoreOS is the underlying Linux operating system It really does nothing interesting as SSH in the kernel Then we have a shared configuration store across five hosts EtcD? EtcD And so EtcD is where all these hosts are able to share a key value namespace So you're able to write simple JSON documents et cetera into this namespace and people are able to watch for changes On top of that we have two schedulers We have a very simple scheduler called fleet that looks like system D and allows for the subtraction of global services So I want to have my log aggregation agent exporting logs off my host I want to allow for this Kubernetes agent to run all my hosts so they can download work So it's a very simple scheduler but it gives you a few primitives that are very useful And then we're running an application This application happens to be a scheduler itself called Kubernetes And Kubernetes gives you the ability to run complex pieces of infrastructure that expose services and does service discovery and then gives you the ability to do these automated replications of simple, scalable, horizontally scalable services And that is running on top of the CoreOS host along with the fleet agent itself And so this is essentially the target application for this CoreOS host You can imagine other schedulers being ran We have users that run Mesos on top of CoreOS just using Kubernetes as an example of an application that can be ran on top of CoreOS that's more sophisticated and gives you something that looks close to the sorts of infrastructure that a lot of us are building which is a two-layer application where you have an HTTP tier and you have a database tier This is the sort of application that Kubernetes is designed to run for you and to enable the system administrator to run efficiently in their infrastructure Yes Yes So is Flanoff's Kubernetes-Kubu's proxies or what do you call them? Do you want to name them? No, so the proxies are there because we are mapping from we need a load balancer So we're giving us a particular port in this case port 80 and we're saying this port 80 you can access and then it's going to do a round robin proxy to everyone who's running that service The Flannel is about I want to have a overlay network so that each of my containers are able to talk to each other across the network without having to take consideration they're in a container They act as if they're just any normal Unix process and so if they get an address of 10244391 it says oh I can just talk to that guy and my packets get routed I don't have to figure out like port mapping or anything like that so and you can imagine that in an infrastructure where you have a software defined network or you have good networking gear that you could do this without something like Flannel but in a lot of cases we're running in very constrained network environments and so you need to overlay a more sophisticated thing that allows for a single host to hold on to lots of IPs Flannel is taking care of the layer 3 overlay network and then Kubernetes is taking care of the load balancing of lots of things that are running in environment to a single IP port combination So you've got the separation between the infrastructure and the application Exactly Exactly So the proxy is all about application so you can imagine taking this service and load balancing it or even putting it into rotation on your DNS endpoint whereas the overlay network is only for internal communication between the containers Yeah Regarding the committees What did you choose with the four replicas What did you choose to put two replicas on the same load? This is a great So his question was when we did List pods It looks like two of the replicas ended up landing on the same host So there's 160.122 got two of the processes and this is because the Kubernetes scheduler is really unsophisticated It just kind of makes a best effort to spread the load around There's an integration with the Mesa scheduler if you want like a really sophisticated bin packing scheduler and they're working on that and then they're working on making it less because essentially this is the worst possible scheduler and they're making it so that the default implementation and go isn't the worst possible scheduler but at least does some heuristics But this is just the current state of it Kubernetes is a pre 1.0 project I think they're planning on it being closer to 1.0 in the next three to six months Right So we have this this hello service file Again, we're going to define this proxy that load balances all these hello services that are now running in our infrastructure behind a single port 80 endpoint So we need to do two things First we need to enable TCP port 80 on our firewall rules on our virtual machines I should have done this in the beginning but Okay, well I forgot to delete the old rule so hopefully that's still is working with our new VMs And then we tell kubeconfig that this JSON file we need to create this service endpoint And then what should happen is that I'm able to list these instances and I'm able to hit one of them on port 80 Yes, yes And what's happened is that we've What's happened is that this service is now exposing in a round ramen manner the all of the instances Now that's not super interesting quite yet because everyone's running the exact same version of the code So no matter how many times are refresh it's going to look identical I'm always going to get the same response So let's do something slightly more complicated What we're going to do is create another replication controller called the canary replication controller And what this is going to do we're only going to have one of them It's going to be on a track called canary So it's going to be slightly different And how canary tracks are used in a lot of infrastructures some developer has said I know 100% that my code is 100% correct and it won't break anything And the operations people say let's measure that And so what you'll do is you'll take the new version of code which is 2.0 And then you will add just a few of these to the environment and see how it behaves behind the load balancer if its performance is correct if it's still handling the same number of requests per second that the old version of the code was running et cetera So it's essentially a way of tagging things like canary in a coal mine of tagging things that you'd like to test out So we'll take this new replication controller which will run beside the old replication controller and you'll notice that it has the two important tags that are in our service file of the name being hello and the environment being prod so it'll be load balanced in with everybody else but it'll have a different track canary So we'll list the pods and we should see one of these here this guy is running on on the canary and running the new version of the code our 2.0 version of this hello world application and so what we can do now is we'll we'll test out that this round robin service load balancer is actually around robin service load balancer so we'll do a while true loop and every two seconds we'll make a new request and so the first request we get 1.0 the next request we get 1.0 next request we get 1.0 now I'm getting a little nervous the next request we get 1.0 oh yes okay there's 2.0 so since this is a super dumb round robin load balancer what you'll notice is that it's fully deterministic and then the next one will be 2.0 hopefully okay well yeah live demos there should be a 2.0 in there somewhere let me see if the thing's still running oh there it is it's on top well in any case oh there we go okay so I guess it's not as deterministic maybe they fixed it so it's not deterministic but it is still a round robin load balancer right and so this service because it used these labels and it made this query of give me all the running processes that are called name and environment prod and then load balance them for me it made that query it doesn't care that this particular process has the track of canary you can imagine that if you're wanting to kick out everybody who's trying to do canary experiments within your environment that you would want to specify the track equals stable okay great so now that's all working great and we have this replication controller that is taking care of our canary track and we can update these two things separately so our this canary json file we can update it with multiple replicas we can change the version to 3.0 or 2.0.1 because that bug was actually there in the code that the developer promised wasn't there and we kind of can roll out these two things as separate entities within the infrastructure and give different people different organizations control of these things all right the last thing that I'll demo within Kubernetes is a rolling update so as we all know infrastructure is not static after we deploy a piece of code we are always planning on deploying the next piece of code at some point later so I will leave this this little curl thing running my gosh there we go okay and so we'll spin up a new a new piece of code called the rolling update and what rolling update does is it takes an existing replication controller in this case our hello stable replication controller and it it it locks that every I believe every minute what it will do is it'll go through and change one of the instances within that replication controller in the stable thing from one to 2.0 so we have this command line that says update update the image of this replication controller this hello stable controller from two or from whatever it is right now to 2.0 and then use the rolling update algorithm this is simply a 40 line go while loop that every minute makes a single API call to the Kubernetes API to slowly update individual copies inside the replication controller to the new version so you can imagine that you can implement this any number of algorithms you could hook it up so that the product manager can send an email that updates a single copy or you can do it via IRC etc yeah the connection file is that you can be there is there some open health checks or that's yeah it's probably because the as I could start my HTTP connection it's nuking the process and then I don't get an HTTP response and this is this is because the service endpoint isn't really application aware and there's code that's being landed in Kubernetes right now that you can back end services with like nginx again this and any application aware load balancer so you can imagine having a MySQL aware load balancer or an nginx or like an HTTP aware load balancer essentially as you move further up the stack and you want to have guarantees on requests returning from some sort of a request response protocol you need to have a smarter proxy than just a TCP proxy right so over time over the next couple minutes what will happen is that this rolling update will change all the requests from 1.0 to 2.0 and we can actually see that happening if we run this which is watch and then kube can fake list pods and we'll see that for yeah four of the instances have been updated to 2.0 and there's only one instance left that's running 1.0 and so in the next minute the last guy will be nuked and then we'll be fully upgrading our infrastructure to the latest version and again like I said this is just a really simple example that uses the API you can imagine something that you know updates one person and then checks health monitoring and then waits for any tweets about your website being down and then requires a product manager to send an email saying everything looks good still before rolling out the next one and these are all workflows that we kind of end up implementing anyways but the Kubernetes API because it's focused on this idea of having a consistent piece of code that helps you roll out the update it's actually fairly trivial to do this more complex workflowy however your company does stuff sort of thing all right so that officially ends all of the demos of this stuff woohoo we're running 2.0 everywhere so I'm happy to talk about and dive into any of the moving parts further take random questions about CoreOS or containers or anything whatever you'd like and that's whatever is at top of mind and I'd like to point out that you should be able to replicate all this stuff all this stuff was started out of a clean checkout of this repo it includes all the JSON files and YAML files etc and the instructions that I essentially just copied and pasted except for my typos that I introduced all over the place so if you want to try to do this at home all the resources are here and you can hit me up on Twitter email if you want to try to do it too yes just while you've got the example running that you had before where all of the pods were on until though it not the screen you just take it so it kind of occurred to me you started off with five of them running 1.0 and they were running on certain hosts and all that sort of stuff did we did we stop the 1.0 ones and start the 2.0 ones on the same host or did we just like stop them and then have Kubernetes go oh you know you said five I've only got four and it just sprays it across again and there's no guarantee that they're yeah it makes a new scheduling decision cool yeah yep and the reason things are bouncing around it's just because it doesn't order things coming out of the API yes oh so she's running okay sorry yes please I've got sort of two questions how easy is it for rolling upgrades how is it to roll back if you you know find you know something starts throwing errors and you have to you know quickly roll back yeah well I mean we can try it out actually I haven't tried to go backwards before but it should work so and then I guess if I remember correctly the command line flag is minus you nope I think it's minus you somewhere minus one of these flags minus you 10 essentially you can use the exact same idea replace the image with the with whatever your old image was and roll roll the update to the old version and I'm pretty sure I got the command like flags wrong but you get the idea it's the same idea and the other thing I was going to ask do you see any sort of a conflict in fleet and kubernetes and like is there any plans to take the features really cool features of kubernetes and merge them into fleet so you can sort of cut out that part of it yeah it's something that we've definitely yeah whoo see the one dot is they're coming back on that so fleet part of the problem is just the way the user interface of fleet is designed to be system D and system D doesn't handle service discovery it doesn't have handle the idea of replication controllers so it's the user interface design that we've made around fleet is that it is system D distributed and it's a really useful abstraction and it helps people take things that are working well and always the hardest part of scaling a distributed system is going from one to two because that's where like everything changes and so we thought it's useful to have that abstraction kubernetes is doing its own thing and I think that's okay and we already have two schedulers within core OS we have the the locksmith scheduler that handles rolling out of updates within an infrastructure and then we have that the process scheduler which is fleet and so I think as people start to think in distributed systems in this way a proliferation of schedulers is okay yes as a database geek one of the primary things that I haven't been able to quite wrap my head around with kubernetes and then either being introduced to fleet is how you would manage a group of machines with a single master yes for a database where which machine was the single master would change yeah so this is something that's been an ongoing discussion within kubernetes and there's a few pieces of the puzzle that need to be solved the first is that kubernetes scheduler needs to be aware of the fact that moving databases is expensive and so we're taking the scheduler API there's designs going on about hinting the schedule API to say same upgrading from Postgres 9.1 to 9.2 I really really really would appreciate it if that 9.2 landed on the same host where 9.1 was running because the wall files hot there and like everything is hot on that box and so that's the first piece is just being able to update software etc the next piece is actually how you do the replication of things and I think that a lot of the existing Postgres tooling or MySQL tooling or whatever can be integrated well because a lot of them are a master and slave replication sort of setup and so you can imagine that's writing a a leader follower replication that is in that is aware of the kubernetes scheduler would be a fairly powerful abstraction to build on yeah I mean I guess so where I'm going with this in terms of specifically the kubernetes configuration is that you know you have you do your initial setup and you've got one machine that's in the master role and has a specific virtual address via kubernetes as the master and then say four machines that are in the replica road and have the and that's important because you have application traffic that specifically only goes to the replicas right well then you lose the master right virtual machine and so you want to promote one of those replicas to now being a master right and I haven't been able to figure out anywhere where I can actually change the role of an already running machine right the so kubernetes right now there also has been talked about having an api for doing a leader election inside of kubernetes right now there's no api for that essentially what you'd have to do is what we do what actually kubernetes itself does is it uses etcd to do the leader election because etcd is designed to do leader elections and so the post gress pod or container would have to be would have a little agent running inside of it that would be etcd aware and it would be trying to do a leader election constantly and then once it acquired that distributed lock it would talk to the post gress local port and say you're now the master of this cluster is there any way you extend that leader election process because there are obviously things inside postgres that determine which one should be the master my understanding was I could be completely wrong that postgres didn't have any sort of consensus algorithm or like leader election algorithm internally it relied on something yeah no but the the issue is that some of the replicas can be further ahead than the others oh yeah you want the furthest to head replica to be the new master yeah so what would happen there is periodically also have the the postgres replicas either cross chat and gossip where they are or register themselves into etcd like i'm at wall position x y or or whatever so we can talk further about the design we prototyped out kind of a really bad version of this of like postgres master election using etcd but i think it'd be helpful to have a discussion about the how to do it correctly I've got the microphone over here oh yes just it was a bit alarming to see the outage happening there and you know if I was running services that where you know outages are not allowed what I need need to use that lock locksmith thing or you know how would I avoid that kind I want to do updates without getting an outage so how would I do that yeah so the way you do it is you have to have a load balancer that's aware of the protocol right so like if you if this had been engine x or something you wouldn't have seen that because engine x wouldn't have returned from a reverse proxy a request that was terminated halfway through because it's aware of an hdp request needs to be completed all the way through the full content length and so okay so I need to have something perhaps like ha proxy or some some something that really is doing the verifying that the services and people have prototyped that I just didn't demo it here so there's our prototypes of using the kubernetes api and and using ha proxy or engine x something that's protocol aware that's what you have to do yeah just for posterity in the node.yaml file that you covered earlier just for when I came back to this and watching it later on sure you had a flannel definition section in there you were saying that that's not quite how you would do it maybe tomorrow or yeah I'll come back to this in the future yeah so does that just become like the other declarations for etcd and fleet yeah exactly so this is the doc that we just did if you want to know the background like the historical background of why we we had to do a lot of work to make this work correctly because we're essentially wanting to run flannel in a container but we need to configure Docker which is our container so we end up with two docker engines running anyways it's a long story really interesting but a long story and but instead you can just say start flannel and then here's here's the the network which removes like 100 lines and boiler plate and it actually uses a container instead of like W getting things from the internet back to the previous question about HAA proxy and using engine x is instead of the load balancer then what's the actual preferred use case so when would you want to use Kubernetes built-in proxy oh when would I missed the last three words what is the ideal use case for the actual built-in proxy service inside Kubernetes rather than going to HAA proxy or engine x yeah the the primary use cases is simplicity so if you don't want to build something harder and so for a lot of things this works fine so memcache redis like things where the you're like your driver will actually be aware that you didn't get the full thingy that you were expecting to get back it just depends yeah essentially there's a lot of use cases and it's really simple and it's built into the thing so you don't have to think about it you could also imagine like apis that are internal like if if you don't care that you're not going to if you'll get a truncated response and you can handle that like a web browser won't handle that too hotly but internal things will be aware that they didn't get the full content like give you an error you handle that inside your code as regular due course five minutes yes Google has a very sophisticated scheduler in Borg which is using to run things at at huge scale why is it overall that a lot of primitives that are coming into communities are so much simpler and a lot of these decisions don't seem to be ironed out whereas you would expect of the last several years that they have actually worked it out internally is it just completely separate teams so a lot of the people that you see working on Kubernetes are the teams who worked on Borg and Chubby inside of Google the reason is is because the hard like scheduler writing the scheduler is actually the easy part compared to everything else when you think about it because what we're trying to do is get consensus around apis and that's the most impossible like problem it goes back to like arithmetic is impossible right it's because once you define an api you're gonna have other third party application writers using that api you're gonna have tooling like kube config or kube ctl you're gonna have dashboards etc and so the first thing you have to do is get consensus around that api then making a really efficient scheduler is easy because it's on the other side of that api and so given the small team that's working on it they wanted to nail the api they wanted to nail the interactions they wanted to get all the use cases of load balancing and replication correct first and then you can go through and take advantage of all the infrastructure you've laid out to do an efficient bin packing and and utilization of the resources the other reason is there's already really good schedulers out there like mesos and so you can imagine using replacing like the dumb scheduler that I used with an existing open source scheduler like the mesos scheduler too because people do use the mesos scheduler in infrastructure today and get really good utilization 60 70 percent of CPU RAM and disk they buy from their hardware vendor so that's that's the primary reason there's they're extremely good team it's just it's not the the nut of the problem yes I have a question about the application so if we use a railroad bin if with the step full in application because this demo about the stateless application right right so what about the step for application like the we use a jms or or any others like session management and other stuff how we use this model on like scaling create multiple boards and then how it scale and then how managed in the application level yeah it goes back to the question around Postgres essentially there's two parts to the answer one is that the Kubernetes guys are designing as well as they can be for existing data stores like Postgres and MySQL and these leader election patterns the other answer is that in a lot of cases we have infrastructure that we buy that already handles the stateful bits of our applications pretty well so like from cloud vendors you're buying a SQL database run a large database that holds on to all of application state I guess there's a third answer the third answer is that people are working on more sophisticated databases that are designed for doing this sort of that expect host failures as the normal case and don't have a strong leader and so there are things like Cassandra or Ryock that are sort of designed in this way where you can you you lose the consistency of the data store but you get the freedom that an individual host going away isn't necessarily a data loss situation and then there's also people who are taking some inspiration from Google's white papers around Spanner and some of their other database products where you sort of want to mix you want to say I need hard consistency with hard replication around these things because it has to do with money or it's some very sensitive bits of information but then this other bit of information is a cache or it's a it's a replicated piece of data for this region that I'm in and I care less that it's safely stored anywhere and this idea of having a tunable consistency and a tunable replication is something that we're seeing in new data stores one that I'm particularly excited about called CockroachDB which is using our raft implementation from etcd and then taking inspiration from the Spanner paper to implement this key value store with tunable consistency and replication so it's kind of a mix of answers and none of them are that great to be perfectly fair but I think everyone is really focused on figuring this problem out for scd what about do we have any API access from application to the scd durable store like we get some information like with machine service the request before and then we get the information from the API the information scd durable storage the scd I think I'll have to talk to you offline on that one because that we just ran out of time but scd does return a bunch of metadata about which scd member like service the request and that sort of thing that may or may not be what you're asking about so we can talk about it all right well I want to thank everybody check out the github page if you need anything remember my name is Kelsey Hightower and I really appreciate the time that you all took to come out so thank you