 behind it and the questions of what is state in the context of orchestration in the new stack? I am not going to be explaining the basics of Kubernetes. There was a talk yesterday by Michael. I give a talk on Thursday, the primary basics of Kubernetes. So if you miss both of those, fake it. Oh, and before we get this out of the way, stateful set was actually under the working name Petset until 1.5 was released two months ago. So if you see the name Petset in places, including in my code, frankly, it's the same thing. Oh, speaking of my code, from some of the code examples that I have here are under that Git repo. And I will be continuing to add other database configs there. So, couple of years ago, three years ago now, something like that. I started messing around with this thing called Docker. Frankly, mostly for training stuff, but I was like, hey, this is really cool. This would be a really good way for me to package up PostgreSQL in order to not have to go through all of the configure and install the dependencies and everything else. It would be really nice getting me more towards my personal goal of fully automated PostgreSQL because right now deploying databases and production is way too dependent on expert staff, even for simple cases. And there just aren't enough expert staff to go around. And simple cases should support automated solutions. But going to the first DockerCon actually got kind of disappointed because everybody was talking about stateless applications. Oh, here, stateless applications. When you do this and everything should be stateless. As a matter of fact, somebody even went out and made a philosophy of it called the 12-factor app. And I think two of the points of the 12-factor app point to everything, all of your applications should be stateless. Well, we do have data, so where are we putting that? Well, this one says is there, any data needs to be stored in a stateful backing service. What does stateful backing service mean exactly? We know what that means, this is what that means. Well, Amazon RDS is pretty good and a lot of people use it in that sort of thing, but I'm a system builder. I don't want a philosophy of design that requires me to use somebody else's proprietary service. I want to build the service myself. It was a little disappointing. So I went back and I took a little bit of another look at the ecosystem and it was just really all over the place with the looking at sort of our new stack and new orchestration. It really felt like we'd built half a house starting at the top. And had no idea how we were going to do all of these important things underneath like foundation and plumbing and electrical conduit and everything else that you need to make it actually a house. You know, because it's like, I'm sure Facebook would be lots of fun without any MySQL databases. Wikipedia doesn't really need data. So then I started to think about, okay, well, I understand that they want to do stateless applications because they're easy because it's obvious how to orchestrate them because you can scale them up and down very, you know, smoothly and that sort of thing. But I'm interested in stateful applications. So it actually turned out, after talking about this for a while, that the first thing that we had to figure out, define, was what is state in the first place? You know, what is, you know, so I mean, we've got a stateless application, so what is a stateful application and what is the difference between them? And are we going to throw out a possible definition here? Ideas, someone? Wow, no ideas at all. Nobody even read my open source.com article. Yeah, there's some kind of data. Yeah, that's one definition. Somebody got another one? Okay, no, that's more like the template or the configuration for the application. Data exists independent of the process that created it. I kind of like that one, that's getting closer. The definition we actually came up with to try to get it as simple as possible and to cover all of the different cases of state was actually the difference between the source code and the running application. That is, in the case of a container infrastructure, the difference between your container image and a running instance of that container image is state. And by that definition, there aren't any actual stateless applications. They're applications that have minimal state. For example, even if you're running a, you know, even if you're running, you know, a RESTful call function on a web server that stores no data for a piece of code and that sort of thing. You still, for however many milliseconds have a current task you're executing, you have a little bit of memory cache and you have your sort of location and or number in the cluster deployment. That is actually a form of state. Now, it is very small and very easily replaced state. So we tend not to care about it and call this application stateless. But, so when you actually look at state here, we go, it's not so much that they're stateless applications. What we have is we have this sort of scale where we have applications that have less state and we have applications that have more state, right? So applications that have less state, you know, would be sort of RESTful functional programming, you know, like, like, you know, I've got, you know, AWS Lambda on that end of that in terms of least state. Going up then through applications that handle data but don't store them, things like CDNs that just store files but don't have other stateful functions up to, for example, stream processors that do have data and configuration, you know, and running configuration but don't store a long term. And then of course the things that have the very most state would be transactional databases. Transactional databases have sort of all the kinds of state and that there are. Another way you can actually look at this is in terms of switching cost. That is, if I have a running instance of this application and I have to shut it down and move it, how much of a cost is that going to impose on my infrastructure? So down in the end of least state, we're talking about, you know, sub seconds of, you know, lost availability for that particular instance. I mean, often you'll have other instances that can take up the load but say, if you had only one instance, you know, you would have however many milliseconds it took you to move it in terms of lost availability and no data loss because there's no data being stored at all. Whereas on the full stateful side, if you have, if you lose an instance unexpectedly, you're going to have seconds to minutes of lost availability as well as in an unexpected loss of an instance, some data loss. And that's sort of, you know, your switching cost and the higher we go on switching cost, the more state that we're talking about. So I did mention, I said, you know, we've got more stateful and less stateful and I said transactional database of all the kinds of state that there are. And there are actually several different kinds of state that we care about and a lot of this came down in a breakdown of the design of stateful sets for Kubernetes is we started talking about what kinds of state are there? What kinds of state do we need to support? Now, the obvious one and the one that everybody thinks of first is storage, right? Is persistent storage of files, data, information, config, you know, configurations that change at runtime in the case of say networking stuff, all that stuff, right? So storage is the obvious one. So I want to get some guesses for the other three. Memory cache. Somebody else, that was actually, that was one that we had listed initially, but we decided that, well, memory cache is a type of state but it's not one that we can actually preserve. So that kind of dropped off the design, yeah? Okay, availability, can you get to it from the location that you choose to? Yeah, that's actually very close to one of the ones that we have. Anyone else? Yeah, IO retrieval. Any more? Configuration, your role in the cluster? Oh, that was very good. What? The state of the applications is running. Yeah. So, these are the four that we came up with. So this guy got it on the nose with cluster role. A couple of you were coming very close to session state and then the other one that nobody mentioned is node identity and I'll show you what we mean by that. Now, part of it is there are a couple of other things that are state, like memory cache that did not become part of the design of stateful set because there simply is no way for us to preserve memory cache if we're gonna have an instance that's going to disappear off of one physical node and appear on another one. Maybe with future technology that might actually be possible but right now it's not. So these are the four that we sort of looked at tackling in here and I'll explain what these four are. So, first one, storage and I've got giant man here because our storage needs to be elastic, right? And that's sort of the obvious version of state. So, and then we start looking at, okay, what do we actually need for stateful applications in terms of storage, right? I mean, it obviously needs to persist in some way. It needs to be able to survive certain kinds of failures for some application side definition of certain kinds of failures and that can be different. But two of the other requirements we realize is that we actually want to be able to move it around with the container or with the application. So I wanna say, hey, if this is a persistent storage for PostgreSQL node number six or a Couch node number 111, then if we have to move that Couch node, we want its storage to actually be movable with it. The second thing is that we realize that stateful applications need mostly because of issues with people trying to work around this with Kubernetes persistent volumes is that we need a way to correlate exclusive right access to particular nodes because for example, if you have two transactional databases and they are both writing to the same directory, what happens? We have a few database people here. What happens if you have like two PostgreSQLs or two MySQLs that are writing to the same files? Yeah, yes. So, and we need to be able to move that sort of exclusive right access bit around with the container that is supposed to have exclusive right access to that copy of the storage. So one of the first workarounds that anybody tried with Kubernetes and you'll still see some templates and that sort of thing out there is that they actually have the initialization code for the container create a subdirectory of the shared storage, which is named after the individual container and thus avoiding a conflict. There are a lot of ways that that can go wrong, which is why it was never really a satisfied factory solution, but you'll still see that. And currently it's the only solution I believe that's available under other orchestration systems. Not that familiar with MISA, so I might be wrong there. Then the second thing is people design some specialty sort of storage plugins. Like particularly there's this thing from a company called Cluster HQ, which is called Flocker. And Flocker actually did have a way to associate storage with particular containers and have it follow them around. Unfortunately what this didn't do was plug into any orchestration system. So it happened sort of completely on the side of the orchestration system which made it difficult to make it actually work in there. So what the folks running, working on stateful set actually came up with was this idea of dynamic network storage and currently it's only network storage. That's a current limitation. That gets allocated per container at deployment time or in Kubernetes terms per pod. Kubernetes wraps one or more containers in something called a pod. In something called persistent volume templates. So the idea of a persistent volume template is you create a definition not of a particular storage location, but of what that storage would look like. So for example if you're running an AWS you would say we're going to allocate an EBS volume based on this criteria. It's gonna be 20 gigabytes in size and it's going to be in this region and zone. And here are my credentials so that you can actually, or here are the application credentials so you can actually create that. And then when you actually deploy a new pod it goes ahead and creates that for you. The, as you're gonna understand from talking about this that's only going to work with certain kinds of storage. It's only going to work with kinds of storage that can support some form of dynamic allocation. But the advantage of that is that storage is by name associated with the pod. And then if the pod which is the container needs to be moved or needs to be replaced the first thing that Kubernetes will try to do is try to reattach its storage. And we'll only reallocate that storage if that storage is gone for some reason. The, which allows you then to say hey I've lost this node, it had three Postgres instances running on it, this machine, you order this Amazon AWS data, three Postgres instances running on it. We bring it back up, those three Postgres instances are going to try to initialize from their original data directories if they can. So here's one in a bit of example code. I'm gonna actually show this to you in the terminal. Ooh, assuming I can get back into the terminal. There we go. So for those of you who have actually seen Kubernetes definitions before, there's the first thing I do which is I go ahead and I define my general storage volume which in this case is gonna be a 300 gigabyte EBS drive. And then, I just realized it's the wrong one because ephemeral does not have persistent volume storage. That's the whole point. And then you create this volume claim template that allocates chunks out of that storage. So Petroni is a system for automated high availability for single master Postgres. And I'll actually, if we get time I'll actually demo deploying it in a minute. The, so that's our main storage concept. And if you go ahead and allocate this, and this actually takes a while which is why I'm not demoing and allocating it. Then you get this set of bound storage volumes that don't go away if the container goes away. There are some, but these are tied to individual nodes in a clustered Postgres configuration in this example using Petroni, right? So here, like in our initial deployment we would have a master and two replicas. Each one of them gets their own storage allocated out of this dynamic storage. Now there are some limitations to the current stateful set volume templates. First of all, right now you don't have the option of using local storage which is unsatisfactory to a lot of database people because we might wanna use the faster local storage and rely on database replication to replace the data if we lose a machine. The recovery logic is up to application handling. And by that I mean, for example, if you lose a Postgres node under catastrophic circumstances it's possible that the files for that Postgres node might be corrupted. And if those are corrupted, you want to stop trying to recover that node from that storage and give it up and have it kick off its initialization thing again. Right now it's up to you to write that recovery logic. And then the last thing is garbage collection because the advantage of these not going away when the container goes away is that you can potentially recover your data from these bound volumes, right? The disadvantage is if the container went away on purpose like for example, if you had 11 Postgres nodes and you decided that you really only needed six to support your workload, those extra five, those volumes don't go away automatically and administrator has to delete them. Which means that if you're allowing users to self serve deployment databases then you're gonna need some kind of a garbage collection job that goes through and gets rid of volumes that are no longer being used. So let's talk about our second bit, identity. Because this is one that a lot of people didn't think about at all until we started trying to implement it and that it actually became one of the most critical parts of the stateful side. So to explain the identity concept, I'm gonna use a couple more superheroes. This is multiple man, sometimes hero, sometimes villain. Has anybody know multiple man from, this is from the X-Men. So yeah, so basically multiple man can make multiple ones of himself, right? Which is a pretty kick ass superpower if you think about it. I mean it doesn't have all their superpowers so all he does is make a whole bunch of himself and get into fist fights. But if you wanna do something like, oh I don't know, staff a presidential political rally, you can see where you can pick up a lot of money. The, so the anyway, but the thing about multiple man is a lot like conventional Docker containers, container orchestration, right? You have your container image which is the original multiple man and then you have multiple, multiple men. But the thing is these multiple men have no identity of their own. They are just copies of multiple man. You don't, they don't have individual names, you can't address them except this one over here. And if that one moves you can no longer tell which one it was. And so that is how you, and that is how we conventionally deploy things in container orchestration and how we should deploy stateless applications in container orchestration, right? Since the amount of state in these low state applications is not anything we care about, we shouldn't care about the identity of individual containers. We only care how many there are. However, stateful applications end up being a little bit more like Spider-Man. There have been a number of different Spider-Men, right? We've got like the original Spider-Man and we have Miles and we have Venom and we have Spider-Gwen and we have Spider-Ham. And these are all Spider-Men, but they're not identical. They have a distinct identity, right? Miles is not the same as Peter Parker. The, and certainly not the same as Spider-Gwen or Spider-Ham. So you don't wanna actually mix them up and you can actually identify them individually. And this is actually what we want for stateful applications. Now, on a practical basis, the reason why we need this identity is things like for a lot of multi-master database systems and multi-master non-database systems, things like CDMs, we need to tell each node who its peers are. And we can only tell it who its peers are if the peers are identifiable by name and address for that matter. For PostgreSQL, for PostgreSQL streaming replication with this thing called replication slots where each replica has to allocate by name a bookmark in the master that says where it is in the replication stream. Plus in a lot of cases in a real replication system for a stateful application, you're going to have nodes that have a special purpose but you still want them to be part of the same set. Like, for example, for a lot of multi-master systems, I know this is true for Cassandra, you have one, the first node has to be the initialization node. You sometimes have one of the common things that you do in PostgreSQL and other relational databases all the time is certain replicas are only for running big reports because those big reports pretty much make them useless for anything else. And in a sharded database, you may have shards and shadow shards so that it's shards and shards that are just there as replacement shards. So in all of these cases, you need to actually be able to make distinctions between the individual running containers. So where this came into is again four attributes for identity. One is that identity needs to be individual. That is, if I say, I want Petroni 5, I should only get one container. I should not get a random one of three different containers. It needs to be durable as in that label should not get reallocated arbitrarily as the network shifts around. It also needs to be predictable. And this was one that didn't come out in the first cut and then we actually realized in the second cut was absolutely necessary because for a lot of stateful systems, things need to come up in a certain order. And I'll show you why that is. And then of course, it needs to be addressable, right? I need to be able to make a connection to ETCD3, you know, by that name or at a specific URL that I can derive from knowing that there's an ETCD3. So we have all of those things in stateful set now. This is an example actually who brings stuff up but let me actually show it to you instead since I have working internet this time. So this is in ETCD for a definition for a simple three-node ETCD cluster. ETCD, what's known as a distributed consensus store, it's a way to have consistent information that will survive the failure of individual nodes with absolute consistency. Very useful for storing metadata. In my case, it's storing the information about who is the current master in a Petroni cluster. So I need to actually bring this up first. So this has, and so there's actually two parts to the stateful set. The first part is you create what's known as a headless service. And a headless service is one that isn't bound to a particular container yet. And particularly, and one, it's an a special service, actually, and here's one of the things that's not all that clear in the documentation, which I need to fix, is that this is also a service that you can't connect to in a normal way. The service does not have an IP address in the internal Kubernetes cloud because it's a special stateful set service. And then this is just giving it a password. And then we actually create the stateful set with replicas. And then the rest of this is actually all set up. But let me show you what that looks like. So you can see here what happens is these individual pods come up and they're all zero indexed and they all come up in order. ETZD1 will not come up until ETZD0 comes up. So there we go, ooh, now that's interesting. Okay, let's see how long it takes it to replace ETZD0. Longer than I wanna wait. So I did point out that this is still a beta feature. It's, I probably have a bad node and a bad virtual machine in this cluster. Well, we will get back to that later. Cause I don't wanna spend, I don't have a bunch of time here to spend troubleshooting while you watch the, so, but they come up in order and that actually delivers us two things. Number one is that the individual nodes, we have both identity for them and that identity is addressable. That is, we can specifically route traffic to them individual ones by name. Now the reason why we need that for ETZD, and I can actually show you this in the definition, even if the deployment is getting wonky, is when you bring up an ETZD cluster, you have to tell it who its peers are. Now in the sample definition, I'm doing this in a fairly hackish way where you would have to modify the definition for the number of nodes that you had. There are other examples out there of using a proper loop so that you don't actually have to hard code this out, but this is relatively easy to read where you can say, hey, I know that I'm going to have a zero, one and two and therefore I'm going to tell them all about the other peers, zero, one and two, and so you have this peer thing. If you have an example of how much advances is for Kubernetes, you can look at CoroS's original definition of how to bring up ETZD. Their original YAML file is about five pages long because of the screwing necessity to bootstrap this and to guess what the addresses of the other nodes are. Now the other thing that I said is that that's actually addressable, so until I troubleshoot ETZD, I'm not going to be able to bring up the Petroni cluster, but I do have another cluster up here. And that is a Citus data cluster. Actually, ETZD should still be working because we have two of the three nodes, right? So we're going to go ahead and kick that off, but I have a Citus cluster here and one of the things that I can actually do here is that I can now connect to individual nodes by name so we have the addressability component. So if I need to connect to a particular shard instead of connecting through the query node, I can do that. Oh, there we go. So for any of you who are familiar with Citus data, these are all shard tables with sharded portions of data. This is actually about a day of Wikipedia edits, which scarily enough adds up to 16 gigabytes. So anyway, so that is our node identity portion of statefulness. Now there are a lot of other things to address in this node identity, replacing nodes from copies of other nodes, otherwise also as promotion because like if I have, if nodes one through 12 are shards and nodes 13 through 24 are shadow shards, I want to be able to use node 13 to replace node one. If node one goes away, that's not currently supported, some name spacing issues, and also support for stateful set, sets across federation. That is if I'm federating out my Kubernetes cluster to multiple data centers, then I actually want to have a single stateful set where say some of my replicas or some of my shards are located in a different data center. That's currently a to do, not done yet. So let's talk about our third form of identity, which is spotted there, which is cluster role, right? So again, we got some superheroes here, right? This is the Avengers as of the movie, and they all have their sort of individual roles in the group. And some of those roles change because they kind of switch off, right? Sometimes Iron Man is leading the group, sometimes it's Captain America, sometimes both of those guys are drunk, and Black Widow is leading the group. And so we need to be able to identify what the current role of an individual node is and possibly change that. So we're talking about things like replication master, shard number 11, storage bucket number something in the case of like CDNs or other file storage, bootstrap node, in the case of a lot of these multi-master systems and that sort of thing. And the difference between cluster role and identity is that identity doesn't change, but cluster role does change. Some cluster roles are exclusive, some are not. And the exclusive ones generally need, in a distributed system, we're going to need support of leader election type functionality. So this doesn't get solved purely in stateful set actually because it can't be. Again, we've had some discussions about this. This is where we actually need to delve into the application and make the application realize that it's part of an orchestration system because one of the other differences between cluster role and identity is identity is something that Kubernetes can generically do regardless of what service it is we're supporting. Cluster role is very specific to the individual application, right? Single master Postgres, we have a master in multiple replicas, right? Take a sharded database like Cassandra, we're gonna have a bunch of shard nodes or lexitis, we're gonna have a bunch of individual shards that are going to be numbered. If we have a CDN, we're gonna have a bunch of individual storage buckets that are going to be or paths by URL that are going to be defined that way. And so it's gonna be very application specific. So we can't really generically do it within Kubernetes. We need, the application needs to participate in this. Now, one of the things that we can do is make use of distributed consensus stores, which is, as far as I'm concerned, the greatest thing to come out of the new stacks is distributed consensus stores because they solve so many distributed system problems. So, because that allows you to share a configuration, including updates to that configuration, it's going to be consistent. The various DCS to support leader elections, and you can use annotations in order to indicate things like cluster roles. Now, by the way, if you don't hear, I've been using DCS, but what I'm talking about here is ETCD, console, zookeeper, embedded, raft libraries, you know, a bunch of other things. If you've seen some of these tools, and it actually doesn't really matter for purposes of discussion which one you're using. The, now there is actually, people have been doing some different things in terms of stateful services. One is you have an external DCS that you depend on. And the other one is you actually embed the consensus in the application. And that's one of those trade-offs that nobody can tell you which one is better. Because there's advantages and disadvantages going each way. Now, the other part of this needs to happen inside the application. And it's something that I call bot programming and something that the joint guys call the autopilot pattern and a couple of other things. And the basic idea here is that the individual containerized application needs to automatically manage its cluster role using simple state machine logic. That is it, for example, in Single Master Postgres, it boots up, it says what cluster am I part of? I'm part of cluster two. Is there already a master in cluster two? No, I'll try to become the master. Is there, if there is a master in cluster two, I will try to join replication. So simple state machine. And this has to be automatic and autonomous. And each one of the individual containers needs to be able to do the right thing independently. This involves some hard thinking, but actually not all that much code, fortunately. So this is actually a snippet from the ETCD information maintained by Petroni. But let me, ooh, Petroni zero is having a problem. Don't know what it is. Probably whatever problem the ETCD node is having. My guess is I have a bad AWS instance, but it's bad in some subtle way that doesn't cause it to stop responding to Kubernetes commands. But you can see two of these are allocated and this actually shows you what happens in your auto start pattern, right? Because in the auto start pattern, Petroni zero came up first, that should have been the master, but apparently it failed. And so when Petroni one came up, it said, is there a master in cluster one? And there's no master in cluster one, so it elected itself the master. And then when Petroni two came up, it became a replica because it was already a master. We can do the same thing with sharded applications. And in this one, we have a query node and a bunch of shards. And again, assigning themselves. Now in this case, I can assign it based on simply predicting the thing, is to say, hey, if I'm node zero and I come up, then I'm the query node and then node one just has to detect whether or not there already is a query node and then it becomes a shard, which is even simpler logic. But the idea is that this is self-determining and that it needs to be self-determining to the point of responding correctly to failures. So as we have with Petroni, as we have Petroni here, where when the original master failed to come up correctly, we got a new master. I'm gonna show you failover later, but I need to get moving on the rest of the presentation. So this is the general idea of cluster role. And so the cluster role logic needs to go in your application container. From my perspective, the best way to do that is, you wrap the application, whatever it is, if the application doesn't natively support this, you wrap the application in a governor program and that becomes the PID one of your container. Some applications, some of the distributed databases and that sort of thing already kind of have this kind of logic built in, although sometimes more or less effective, yeah? So the question was, you can inject this through the orchestration system and you'll parameterize it. Yes, you can, but I recommend against that actually because the problem with any centralized control of cluster role is that it's vulnerable to communications failures. And I know from having developed a centralized Postgres high availability system myself, all the different kinds of failures and all of the literally hundreds or thousands of lines of code designed to cope with those kinds of failures. It's just a lot easier to have the container by itself do the right thing. Now, there are some issues around this that that still need, could work and that sort of thing. One is that switching of cluster role is obviously asynchronous because you're waiting for individual nodes to notice that the ecosystem has changed. You can get that asynchronicity by using things like gossip protocols and that sort of thing, get that asynchronicity down to very small amounts of time, but it will never go away. Which then results in some apparent inconsistency during a failure event. Apparent, real inconsistency during a failure event. If you are doing an asynchronous data system like Postgres asynchronous replication like a lot of the multi-master databases then there will be some data loss during an unexpected failure. And these depend on using a consensus protocol which means that if you lose enough nodes out of your cluster, you can no longer form a consensus and you do not have a service until an administrator intervenes. That's one of those ones that you just have to understand is an insoluble problem and be prepared for it because there isn't actually a solution. That comes down to sort of cap theorem stuff. And more importantly, it turns out that testing the state machine logic for these sorts of things is hard and doesn't have, there isn't a lot of good testing frameworks out there for testing distributed applications. So that's an area that needs a lot of work. So we still have time. We're gonna talk about our last and sort of least complete portion of dealing with state and that is session state which a couple of people mentioned in different aspects of session state of why it's a thing. So the basic idea of session state is that, yes, there are application requests that are essentially stateless, right? Or where the state is limited to a single response. You ask a question, you get a response and then like a web server, right? An HTML server, you ask a question, you get a response at which point it forgets you ever existed. But that's not true of all kinds of requests, right? If you're downloading a large file, you actually have state in terms of where you are in that file. Video streaming, database transactions, right? You have a database connection, you're in the middle of a transaction. If you switch which database backend you're connected to in the middle of transaction, you lose your transaction. Data state, a lot of authentication servers like GSS API, other servers also have a multi-request connection where if you lose it in the middle you have to start over from scratch. So this requires us to actually kind of pay some attention to the state of individual sessions. Now, this is one where better solutions need to be developed. Right now, the two things that we sort of have, the thing that works out of the box with stateful set is to use Kubernetes Discovery DNS, right? So instead of what happens is the way that I connect to my stateful service is I use the current service definition for whatever aspect of the service, right? I need to connect to the Postgres master. I need to connect to a query node, something like that. And then because you're using DNS when you actually form the connection, the connection is actually formed to the individual pod and not to some kind of a network proxy. And so then it's just like you connected to a regular service that wasn't running as part of a container cloud. The drawback to that is if a failure happens in the middle of your session and you have a network split, you could be on the losing side of the network split and until your session ends, not realize it. Which means data loss. The way to work around that is to actually write smart proxies that respond to changes in cluster role and application cluster definition. The drawback to that is nobody's written these yet. I'm working with a crunchy data folks on writing one of these for single master Postgres. But it's pretty far from done. And people need to write them for all of the other different applications, is to have actually smart proxies that understand the cluster state. And then of course the proxy itself becomes another service that you have to have a definition for and you have to load up and you have to manage and scale appropriately. So, but for right now, Discovery DNS just works. Let me actually show you a, well actually that was what I was showing you. Another definition of this is, so for example, oops, except that database does not exist. So let's do that. So here I get one of the replicas or I can connect to the current master. But that's not all that useful because what I really want is to follow that. So the way that I follow that is by using labels. And this was an interesting discovery and undocumented Kubernetes feature that has now become very useful discovered by Olexi. So one of the things that we can do in our cluster role code in the individual node is actually write a label to our own service, just a tag to Kubernetes. And then we can use that label to actually select an address through Kubernetes services. So what's happening is, when Petroni 1 booted up as a master, it wrote this master label to its individual pod definition in Kubernetes. And then this service, every time I invoke it, will automatically route to whoever the current master is, as long as that label's been written. Oops. It really helps us actually bring up the service. And so there we actually have the master. And that'll actually end up working for a lot of people. But there are reasons why we also need to build smart proxies. So I think building smart proxies that work as part of the orchestration system is going to be a thing. It may eventually be a Kubernetes feature once we have a handle, if it ends up being something that is generic enough and not too application specific, that the smart proxy will be an object that will then bring up a special code that will be application specific, but we'll see. So again, here are our four different kinds of state that we're addressing through stateful sets. And what I would say is the level of completion of solving those issues. Identity is almost completely solved except for some special cases. Storage has some work left to do. Cluster role has a little bit of work left to do. And Discovery DNS is, I would say, only about half solved, or session stage is only about half solved. So we have, oh, since it's right before lunch, we can take five minutes for questions. So does anyone have questions? And we have a mic here that we can pass around for questions. No? I answered everything you wanted to know about state. Okay. Here, wait, I'm going to ask somebody to volunteer here. Bruce, you want to be my Phil Donahue? Okay. I have a question you're carrying around. Carried around. Okay. That's what I didn't mean with the Tiara. Okay. Yeah. There's a question. Oh, for storage, is the only solution through Kubernetes or is there other solutions besides that? Well, there may be other solutions besides that. I mean, Flocker is still out there, although cluster HQ is not, so I don't know what the level of support for Flocker is going to be. The supposedly Docker is working on something for Docker Swarm that doesn't exist yet. Somebody was telling me about a third party plug-in for Docker that actually did do something very similar to persistent volume templates. So I'm going to go ahead and answer some of the questions very similar to persistent volume templates, but I'm now not remembering what it was called. Yeah. Oh, Rexray. Yes, thank you. So if anybody saw this, we actually had a presentation on it on Thursday. This thing called Rexray that actually does that for Docker and it's a Docker plug-in. Not supported by Docker Inc. And they'll probably break the API for it because that's what they do. But right now it works completely. So you can use Rexray. I don't believe that there is anything for Mezos because I was talking to a Mezos contributor just the other day and they were saying we don't have anything like this. No? Oh, really? Oh, okay. Oh yeah, here I want to take the mic here. So I don't have to repeat what you're saying. Okay, so back, I believe in October, Rexray was cut and pasted right into the Mezos Apache repository. So it ships with Mezos in a late enough version and then DCOS, which is an alternate form of Mezos, not only has Rexray, but it deploys it to cluster nodes and it health monitors it. So it's built into even the UI for monitoring the health of Rexray. Okay, so the answer to your question is yes there are other things. By the way, do you want to give people your name because they want to ask you a question about Rexray? I'm Steve Wong with the code team. It's a group at Dell. It's assigned to work on community open source projects. So I work on Rexray. I also work on Mezos and Kubernetes. Not much time left in the exhibit hall, but we have a booth there if you want to have follow-up questions on Rexray. Okay, other questions? No, nothing about identity. I can do, wait, we can do a failover demo. I have a question while you're doing that. Cool. So if I just, for a very high level, it sounds like you have two states. You have the storage state and then you've got some kind of state in a proxy. Is that right, pretty much? Yeah. Storage state and then state in a proxy. So, yeah, well, so there's stuff being put into storage and that storage is associated with individual pods. And then other forms of state because we've got other forms of state because we've got like cluster role and that's basically being stored in labels. And then for identity that is associated specifically with the naming scheme of the individual pods. So it was your... Yeah, it seemed like there was two states. There was a state that you're storing the data in and then there's the state that coordinates everything together. Okay, so let's kill the master here and see, this will take a second to figure out that the master is gone and then it will fail over and do the new label. Okay, more questions? Well, we're waiting for that? Nobody? Hold on. Ah, here we go. Ned, suddenly realized that it lost. Ta-da, okay. There we go. Wait. So when are we going to realize? Aha, there we go. Okay, it took a little bit longer than I expected but we've got failover and Petroni1 got replaced and became a replica. It's an example of doing it for single master failover. Okay, well thank you very much. I'll actually be... I'll be at the Red Hat booth for the hour that we have left in the hall if you think of other questions later. Thanks. Sure, I mean, it's online. Right, well that's because, and I didn't actually find out, hold on. Yeah, I don't know why that pod isn't deploying like I said, I think I actually have a bad AWS instance but it's... So what is happening there? Yeah, I put it on for this time. Okay, so you're just going to put it over your ear and then you're going to put this like near your mouth and then you can already. And then you're just going to clip this to your belt and on and off switch is right here. Okay, alright. Anything else? I finished away now. Okay, I just got to run to another talk, so. Okay. Hello? You okay hear me? Yeah? Good. Okay, you hear me? Yeah? Good afternoon. Thanks for coming. So today I'm going to share some of our small tricks on using Docker when Ticketmaster moving into the public cloud. So to appreciate a little bit about you guys coming to my talk, we have a small raffle. We don't have too much people here. So by testing we brought you to this number 404040. You may get a chance to win $100 Ticketmaster Ticket cash. You can use that money to come to our site to get tickets. So I'll give you a couple seconds to text or take a picture. Do it today, expire today. A little bit about myself. My name is Andy Chan. I've been joining Ticketmaster for about three years as a principal system architect. Before joining Ticketmaster, I used to be a software developer, I did Java, I did Python, I did a couple different kind of language. So right now I'm working in a team called the cloud enablement team. I will talk about that a little bit later in the presentation. After this talk, if you have any question about Ticketmaster or some of the tools that we're using in this presentation, feel free to follow me or text me using Twitter or LinkedIn. So before I tell you those small checks, I have to tell you some stories about Ticketmaster so that you can understand why we have that approach. Ticketmaster was founded in the mid-70s in Arizona, U. In the 96, Ticketmaster.com launch, if you guys remember, that time is when the.com start. That is the era. And we are one of those.com still survive today. And then in 2010, we joined forces with Knife Nation so we can deliver the best life event experience. Today, right now, we are a public trading company and we are one of the top 10 e-commerce website in the world. So you may wonder, how does the traffic look like as one of the top 10 e-commerce website? Emerging every 20 minutes, that's a life-changing event happened somewhere in the world. So you can think about how many events we've been selling in Ticketmaster. That's translating to about 500 million tickets being sold in 27 countries. So enough business talk, let's go to technical a little bit. Most of the e-commerce website, like Amazon or Walmart or eBay, most of the time is during Black Friday or Cyber Monday. For example, like Apple, their highest speed time may be when they launching their latest iPhone. It happens only in that particular place. For Ticketmaster, every time on sale, it is our Black Friday. So in other way, we have Black Friday almost every single day or maybe multiple times in a day. When that happens, there's a big spike coming into our network and we have to handle that traffic. Also, another part different than the e-commerce site is, for example, like Apple selling iPhone, they can sell as many iPhone as they can sell as long as their factory can make the iPhone. But for us, our product is unique. Each seat is a unique product. So what that means, it means we have tens of thousands of fans locked into our site. Just try to get that particular seat at that single second. It is really, really challenging to handle that kind of traffic. In order to handle that traffic, we need really cool, really great technology. Ticketmaster is a 40-year-old company. So we are using all the latest technology for that every single era. So this machine, I don't know if you guys know what that is. It's something called the mid-range mini computer. This one, in particular, is called Wax. If you don't know what Wax is, it is a fancy version of a PDP-11. Maybe I'm talking to Geek right now. If you don't know what's PDP-11, I'll give you some idea. Bill Gates, when he rolled the first basic interpreter, he was using PDP-10 in Harvard University. So you can tell how old this thing is. So the first piece of Ticketmaster software was written on a PDP-11 and then commercialized onto a Wax in the data center. So that's the coolest technology at that time. And then fast forward, right? In the 80s, in the 90s, we started implementing Windows client on top of the Wax. And then in the 90s, we started adding more, better infrastructure. So we introduced our own custom make private cloud. We're using NFS, we're using Zen. So at this moment, you may think about, I'm talking about history, right? Now it's not history. The Ticketmaster stat is actually stacking up one by one. This machine was still using today. From what I heard is, at one point, we basically used all of business. At one point, we have to go to eBay and buy parts to fix that computer. Until at one point, we cannot buy parts anymore. We have to virtualize that. So today, this guy is running inside an emulator on a Linux machine in our data center. We're still using that. So look at our technology landscape. Today, we have 21 ticketing systems running 250 products on many different kinds of operating system. Windows, many different favor of Linux. Also, we have thousands of database. And we have about 22,000 virtual machines in our custom make private cloud across seven data centers. We're using every single modern programming language. So right now, there is Java, Node.js, Angular.js. We add really cool stuff. We also, on Windows, we're using .NET, C Sharp, Win32, C++. We also have Visual Basic 6, PowerBuilder, Assembly, Pascal. They are all still running inside our data center today. So what that means? It means this, 40 years of tap debt. All this tap debt is blocking our business to innovate. We got to have to fix this problem. The easiest way is building a new data center. But building a new data center, look at our landscape. We have so many data centers. We have so many hardware. It is really difficult to build a new data center. So enough of the solution right now is we are done. We are using Amazon as our preferred public cloud provider. So now we got the data center. Then what? Migration. We have to migrate all our application, all our product into the new infrastructure. So there are many ways to do the AWS migration. The easiest way is just copy the image from our VM, put it inside into an Amazon machine image, and deploy it into AWS. The problem about that is we also inherit the 40 year old tap debt. It won't work. We want to leverage the migration as a carbon filter to get rid of the old stuff, to get rid of the stuff that is not making any sense. We want to improve our tap maturity. We want to make sure every single department can be CICD all the way to production without manually interaction. We want to enable the developers self-service without the ops being in the middle of the way. And we also want to make sure every single technologist that is up to date and model. In order to do that, we have to change the way how we deploy into AWS. We need to create a safe route so that the developers forget about what they did before, and using the correct way go to AWS. So our solution is, number one, we disable SSH in our AWS environment. No one can shell into the machine and do in deployment. They have to using our own CICD pipeline. So right now, the standard is in AWS, we are using the call OS AMI as our base image. And every single application must be container-sized inside a Docker container. If the application is too big to Docker, that means you're too big. You have to break it down into small pieces, maybe microservice or whatever way. If the application have to, let's say the container have to be rebuilt for every single environment, what that means? It means you got to moving out your application properties into somewhere else. The Docker image should be immutable across different environments. So let's say, yeah, we have some really, really old software that cannot be migrated into Docker. There's some of them, right? Docker is not like Silver Bullet. In that case, the business have to think about, can they rewrite the application? Or can they get rid of it or figure something else? Docker is our only way to get into AWS. So in this diagram, you may notice a purple box on the top right corner. So we have Windows team, right? I mentioned that we have a Windows stack. So some of our team, they're using .NET. So as you know that, Windows inside Docker is not ready yet. So for that team, the solution is they want to leverage .NET call. They want to move their .NET stack into Linux. So that's one of the solution. And then once the application migrated into Docker on our core OS on EC2, the next step is moving into Kubernetes. We're using Kubernetes as one of our course optimization solution. Of course it's more than that. But by moving, cause we don't want to move in legacy application directly into Kubernetes. It won't work. We want to using the EC2 core OS Docker as the gateway so that we make sure only the good thing can go to our Kubernetes cluster. So the story is not ending here yet. We have the solution how to get into AWS. We have the solution how to get rid of our tech debt. Now the thing is how can that happen? We have 21 ticketing system, 250 products. We have several hundred of developers around the world. Asking them to self-service migrate to AWS, it won't work. We need to have someone creating safe rails. We need to have guidelines. We need to rebuild our tools to help the developers. So the Ticketmaster solution is we form a new team of six people from different backgrounds. We have people from software engineering. We have engineer from database. We have a system engineers. All six of us come up the something called the migration guide, the migration toolkit for our developers. We define the future state of the architecture. We write the toolkit, we write the tools. We create a lot of documentations. We are the one helping the developers moving into the cloud. So this presentation mainly focuses on the tools. How can we build the tools to help our developers moving into the cloud without increasing their workload? So let's look at the developers. What is their challenge? When they're moving into the cloud, they have multiple learning curve. They have to learn what is AWS. They have to learn AWS DK because they may need to rewrite the application. They have to learn the Ticketmaster way to do things in AWS. And then they also have to learn all the new tools that the CET team, the cloud enablement team create for them. On top of that, business as usual. They have to keep the business running. They have their own business priority. What that means? It means the new thing that are on top of them cannot be complicated. It must be frictionless. It must be easy so that they can be using that just like that. So we come up with a principle, the requirement. When our team creating the tools, we come up with a couple of bullet points. Number one, the new tools must be on command line. Yes, the web tools is fancy, it's eye candy. But sometimes the web tools is also difficult to use. We want to create command line tools. So number one, it's easy to use. And number two, we can also be using those command line in automation. Let's say we want UI, right? We want a UI version of that command line. It's easy, right? You can just pull up whatever logic from the command line and just wrap it around with a web container and that's it. And number two, the tools must be cross platform. We have developers using different version of Linux. We have developers using Mac, using different kind of windows. Our tools must be available for all those developers. Number three, easy to distribute. When we have a new version, we have to make sure everybody get the new version. This should be easy to do the upgrade or easy to roll back. And at the end, again, easy to use. So let's look at the low hanging foot or the most challenging question depending on how you see that cross platform. Do we have any developer in this room? I feel. So do you know what this slogan coming from? Java, right? Is that really true? Yeah, depends, right? So I find some of the Oracle market piece from the web. It's so funny, they got Kindle right there and Kindle is on Android. Yeah, they say that the Java can run on a toaster, can run on everything. But is that true? Maybe, but before we go into that, can we using Java to do system programming? I was a Java developer for 15 years. Maybe I'm not a good one, but I can say that by doing a really simple system command, it could take like a couple pages long of Java code. It's really difficult. And then next, Java right now is not installed anywhere. Yes, you got open JDK on your Ubuntu, but it's not on the Mac, it's not on your Windows. And we have a lot of developers in Ticketmaster is not using Java and they do not want to install Java on their machine. So Java doesn't work. What about scripting language? Python, Ruby, Perl, pretty much the same problem, right? It is out of the box on Linux. It is out of the box kind of on the Mac, but not on Windows. And Windows is even more difficult. Think about you have to using Python 2 and Python 3 on Windows. How can you do that? It's difficult. It's doable, it's difficult. And Windows developers, they don't want to deal with this. They just want to focus on the Windows development, right? So it doesn't make sense to force our developers to install stuff that they don't use every day. And then we think about this. Goal. Anyone written goal here? I feel? Yeah. Yeah, if you write goal, after the talk to me, we are hiring. So let's look at goal. So what's goal? Goal is an open source programming language written by Google. It's really cool. It's kind of like C syntax, but the language itself is by the side concurrent. And really easy to learn. I was a Java developer. I learned goal in a couple of weeks. So when we look into the goal, one of the advantage about goal is it can cross-compile into different operating system. So on my Mac machine, I can compile Windows binary so that I can copy to a Windows machine and run it over there. Hey, seems like we got a solution, right? This thing is cross-platform. And our developers, they don't need to install the runtime on their machine. So you solve the cross-platform problem, but it doesn't solve the distribution problem, right? Windows binary, we may need to create MSI, Mac, DMG, the Linux, maybe RPM, or you can say, yeah, why don't we just top all it? If we top all it, we still have versions, right? 1.1, 1.2, 2.0, 2.1. How can we guarantee our developer getting the latest stable version? It's difficult, right? We cannot ask them every morning when you sit down, have your coffee, and download the latest version and install it. We won't work. They don't want to do that. It's too much work. Now we're back to 1.0. It doesn't work. And then we say, hey, wait a minute. Don't we just solve the server-side problem? How did we solve the server-side problem? We also have Windows. We also have Linux. We also have any kind of OS. How we solve in the server-side? We use a Docker, right? So let's look at Docker. Docker can run almost anywhere in the modern operating system. It can run on the Mac, natively. It can run on Windows. Depending on what version of Windows you may need to use a Docker machine. You can use it in the Hyper-V. And on Linux, it's already there. It solves our cross-platform problem. And then, number two, what about distribution? In Merchant developer, they just do pull the image of the latest version. Maybe every day, or maybe they put it into the batch profile. So wherever times they stop the shell, it will do that automatically for them. So now they can always get the latest version before knowing what is the latest version. And let's say, yeah, we introduced some bug on the latest version. We just roll back internally, and they say, hey, get the latest version. And then they get maybe the previous version. Or they can specifically say that, hey, I want a particular version because I know that one work for me. Or when we have some new development, we can have teams say, hey, why don't we try our experiment branch, right? So by using the Docker pull, we can make the developer very easy to pick up which version of the tools they want to use it. And the byproduct of this is for the cloud enablement team, as well as I mentioned before, we're coming from different backgrounds. I come from the Java world, so Java is my primary language. Of course, when I go, we have database engineer and the primary language, maybe Pearl. With a system engineer, the primary language could be Pearl or Python, right? So without Docker, it is really difficult to agree on a single language, a single runtime, so we can distribute it to the developer. When we don't care, we can write a piece of code in Java and maybe start typing that with something else, with some Pearl, and then really messy stuff. And then on the outside is a really beautiful Docker container that the developer is using every day. They don't need to know what's inside there. All they need to know is the command line interface never change. And then also because it's written inside a container, we can go to the open source world and look for a solution. And then we can, like Lego blocks, we can connect multiple pieces inside a single container as one single solution or one single command line for our developer to use. So let me give you some example. So in Ticketmaster, right now, we are moving into AWS. Instead of using CloudFormation, we decide on using the HGCOP Terraform. Um... One of the most advantage about Terraform is it somehow is a multiple Cloud provider, so you can use it in AWS. You can use that in the Google Cloud Platform or Azure. But for our argument is this is open source. So let's say Amazon, they released a new version, a new product yesterday, and they make it available in the SDK. Funny enough, it's not available in their CloudFormation. Until a few months later. However, as long as that functionality is on the SDK, people already contribute. The next day, we get that function in Terraform. We don't need to wait for Amazon on the CloudFormation roadmap. So that's why we choose Terraform as our AWS infrastructure as code framework. So let's look at the problem about the Terraform. Open the box, Terraform is a stable software. When you're using Terraform to deploy your infrastructure into, let's say, AWS, it will result in a stay file. The stay file will store on your hard drive. If you lose that stay file, you will lose track of what you deploy into AWS on that particular deployment. So that stay file is really important. And that also means the stay file, since the stay file writing on my hard drive, my teammate cannot use that stay file, so we cannot work as a team to handle that deployment. Luckily, Terraform supports something called writing to something back then. So one of the options is writing to an S3 bucket. Looks simple, right? But actually it's not. Looking on the command line. In order to do that, we have to delete the Terraform cache, and then we have to issue that really complicated argument so that before we can do the deployment, you have to do that every single time. No way we can ask our developer to remember this. You may say that, yeah, let's write a shell script, right? What about our windows folks? I don't want to write PowerShell. I don't know PowerShell. So now get back to the solution that we say, right? How about we bundle this into our Docker container? So our first version is we're using a Ubuntu base. Don't ask me why you're picking a Ubuntu base, which is a 500 megabyte size image. We just randomly pick one. So hey, Ubuntu image. And then we put the Terraform inside there, and then we write some bash script. And then we package that as something called the take a master Terraformer. And then we try it out, right? We try it out ourselves, and then hey, it's cool. And then we send it to our developer, some of the beta tester, and they love it. So how that works? Once we have the image, when they have to run it, they still have to issue that command line on the first one, the yellow color one, right? Docker run, remove, interactive, and we have to mind the volume, this and that. It's still way too complicated. So our solution is how about create an alias on the shell, and now you got a command, it's called the Terraformer. By doing this, a lot of our developers even forget about the Docker run. Sometimes I got a question, hey, where can I download the latest Terraformer? They think that is a command. They think about that as a binary. So we completely abstract the piece out from our developer. So now, I just mentioned about Terraform is really cool. They handle a lot of pull requests. So let's say one day we need to adding a new feature into Terraform. They already accept our merge request, but not release yet. So what we do is we cherry-pack our implementation, put it back into our internal build of Terraform, and package that back into the same Terraformer, send it out to our developers. Now they have the new functionality. And at one point, we say, hey, let's remove Ubuntu, change to Alpine. We're writing the shell script into Go, make it smaller, make it more unique. Again, the developer, they don't know about that. They're just using that day-to-day. And then how about we adding some more logic to check to make sure they don't do things that is not compliant to our policy, right? So we're adding one more layer to do validation. Again, they don't know. The story keeps on going and going and going. And then at one point, the HACOP Terraform, the new version coming out of 0.9, they can handle the state file better. It's a lot easier to use. So what happened? We removed all our implementation just using the Terraform 0.9. Everything is transparent to our developer, frictionless. Another example, I just mentioned before, right? We are using, since we're moving from on-prime into AWS, we have to change our CICD pipeline. In the past, we're using Jenkins to do our build and all kinds of magic stuff. GitLab is a, the community version is open source. We have been using GitLab for many years just for the source code management to store our source code. So a couple of versions ago, they stopped bundled GitLab CI along with GitLab. Anybody here using Travis CI before? I feel. So you know github.com, right? Github. And there's a company called Travis CI. They're building a build platform on top of github. So the GitLab CI work kind of like that format. So what is the magical part about GitLab CI is the GitLab CI support different kind of executor. You can run the job on a shell. You can run the job remotely using SSH. You can run the command to job using Docker. So now we have the Terraformer Docker image. You can just put inside here and then now a GitLab CI can execute our Terraformer on the fly. We are using Docker exactly like command line. The other advantage about GitLab CI is in the past, if you have using Jenkins before, you know everything is inside the master. So let's say if your team is using a different kind of language, you have to install that language from the master. Or if you want to use some kind of new plugins, you also have to install the plugin on the master. By using GitLab CI, because we are leveraging the Docker container, now the developer have their own freedom to using what kind of language to compile. They have the freedom to using any kind of integration tools to do their build, to do their testing, or to run their report. So we leverage Docker inside our build pipeline so that we can build the Docker image we deploy to the cloud much easier. So summary, the good thing by using Docker on the desktop, we can easily to upgrade the software fictionist. The developer, they don't need to know what version they need to upgrade to, and they can get the latest version very easy. Secondly, the Docker container on the desktop, we have shared that compressed piece from the developer. So on the back end, we can do a lot of really complicated stuff for them to solve their problem. And lastly, the Docker runtime is almost cross-platform, so that because we already using Docker for server-side deployment, install Docker on their machine is already part of the requirement. So we don't need to install additional runtime on the developer's machine to raise their resource. We have a happy face, we also have a sad face. And most of the sad face is coming from the window side. So Windows for Docker on Windows 10 is still a lot of bugs. They are getting better now, but still a lot of bugs. And also it's difficult to set up because that's the nature of Windows, right? You got the command palm, you got the PowerShell palm. If you're using Git, you got the Git palm, you got so many problems to use. And then when you're using Docker, you have to do a lot of work to link everything together. And then we're still waiting for the windows actual container, running on Windows Server. It's not there yet. All right, I think I'm doing a little bit fast. Right now I take some questions. Any questions? The mic's not on. Yeah, I'm allowed to open it. Okay. One question, you're trying to eliminate technical debt. With this new project, what are you doing with testing? I know GitLab has some, GitLab CI has some good testing flow, but how do you address testing and, you know, testing coverage of the code and what that is? Okay, so the question is how are we using GitLab CI? Well, how are we handling testing the code going forward? Okay, so the question is how we handle the testing code going forward, right? So we, most of our application already have unit testing and integration testing, mainly using QCumber. So most of the application already have that piece implement, maybe not a good coverage, but like 70% range. So by moving into this way, into the new, into the cloud, basically if you do that already, just keep doing that. But if your team is not, it doesn't have anything, we don't allow them to move into the cloud yet. They got you meeting the minimum requirement of the tech maturity in order to start getting into the cloud. So they may be not the first one going inside AWS, but hopefully sometime they will get inside there. And then for the more GitLab CI part is GitLab CI, the cool thing about that is you can define different kind of runner. So for our solution is we have one type of runner is doing compile and building container. And then we have the other type of runner is mainly for testing. So what testing mean is, in the, since we are in AWS already, so the way we recommend is deploy your whole application into AWS and then using the testing runner, maybe running QCumber and testing against with that infrastructure. When everything is done, just destroy everything. So I think I answered your question. Okay. Curious how you manage the actual images. Is that the cloud enablement team that builds all the base images or is that a dev specific each team does their own container and then pushes it? And then also how you, if the registry you're pulling from has access controls, like if anyone can just push to the registry, do you have problems with someone overriding? Cause it sounds like you're really abusing the latest tag. So you just keep overriding latest every time. And do you have problems of multiple people trying to push that or the container being different when they're sharing the same registry? That's a good question. So right now we are using the Amazon ECR as our Docker registry. At this point, we are in the early stage of migration. So right now we don't have too much governance about, let me take it back. We have the, so no one can push the image from their own machine. In order to build a machine and push to the ECR, they got to using our GitLab CI runner. So when they have a GitLab CI runner, we know who did what. Although we may not have a way to actually, like have set up a gateway to preventing people doing that, but we have a way to forensic the log and trace back who puts that image. And that's, I'm assuming any team can push to the GitLab that then pushes to the registry, or is that a specific, do you have a centralized team that makes all of the best images? No, everything is self-serve. So the developer, they own the application code, they own the Docker bill, they own the whole infrastructure. And then they just distribute that alias to the rest of their team and say, okay, we have this new tool called foo. Just put this alias in your bash and then run foo. Yeah, so yeah, so I got your question right now. So the CT team, the call environment team has a set of standard tools, but since the team learn our way to do things, they may have their own internal tools, so they can copy our model to create their own tools to do something similar. And they have the freedom to push into the registry and then maybe make it available for other team as well. And do you have problems managing the Docker toolbox on Macs? I mean, I have a lot of issues with making sure that Docker toolbox is updated for everyone and it's set up the same way and it's deployed with the same settings. And as well as when you do the alias runs, some things need access to your home drive for some .file there. So do you have the alias is just always mount those into the runtimes? Most of us is already using the native Docker for Mac, so we don't use it in the toolbox anymore. And on the native Docker for Mac, they have the GUI and the inside there you can say, by default it mount your home drive already, but from the GUI you also can mount anything else from your machine into the Docker engine. So it's a little bit easier than the toolbox. Right, so that's somebody default. And then last question was, how are you actually weaning people off of NFS? Because that typically is a bad thing in containers where they're not running. Well, the funny part is in Ticketmaster, everybody already get into some kind of trouble by using NFS. So the good thing is everybody hates it. So now we have a chance to get away from NFS, everybody's happy. Of course, some application is really tied to NFS, so the migration is a little bit rough. But we are able to mitigate it that by using switch to EBS or some kind of other store like DynamoDB or S3. Right, so when you deploy the AWS and Kubernetes you just mount an EBS underneath the container. Yeah, if they do Kubernetes, yes. Kubernetes by default they have to share a volume. But when I'm talking about like Terraform, the basic EC2 infrastructure, then they have to change to like S3 or EBS, something like that. Okay, thanks. Any other questions? So my question was just, was it easy to get developer buy-in to switch to the new system? Did you find a lot of resistance? People don't want to change their ways. Because we're trying to make this change at my company right now and getting some of the developer buy-in is a little difficult to just, what are your hearings about your experience with that? Yeah, if I say no, I'm lying, right? I would say most of our team is exciting to do the migration. As I say again, we are 40 years company. We have a lot of legacy stuff. And a lot of people they say, yeah, the business part is always winning the tech debt. Now we are using these as a excuse to resolve the tech debt. So most people are happy. The unhappy people usually is because their application cannot move and they struggle, right? So we are looking into that and we are trying to create a solution for them. But I would say 80 to 90% of developers they're happy for the migration. Thank you, yeah. How's performance? In what perspective? On a development environment. You mean AWS? You mentioned loading Terraform on your local, using a local VM. Oh no. So we don't use a local VM, right? So I guess the question is we are using the Docker container on the desktop and how's the performance? Okay, so most of the tools are really simple tools. For example, we have our own, so that's Terraformer is one. So this is Terraform inside container. We also build tools to do secret management, to do encryption decryption and also to do like application property management. So all those tools are really simple code. We think in go, in native form. So we are not like, we are not talking about running a really complex web app inside the container. So human being won't be able to tell that by using the Docker container on the command line is slow. You won't be able to tell. Maybe if you really time it, you can say, hey, it doesn't make any sense. But I mean, human being is okay, so I guess it's fine. Have you branched out beyond command line tools for either things that require web interface or actually amount like an accession on a... You mean for the tools? Yeah. Yeah, so as I mentioned just mentioned, we have a tool to do secret management. So we were doing that in command line before, but somehow it's so complicated to use. So end up is we're using, we build another container that using go as a web app to launch the command line inside the container. So now to our developer, they only can see the web interface, but behind the scene it's still the same command line to do the encryption-dequestion process. Anyone else? No? Okay, so it's a little bit earlier, so thanks for coming. Test. All right, we can hear me. Can I see you here? Good afternoon, everyone. Thanks for coming. I just found out this room was sponsored by Docker and it was sort of an amusing pleasure to hear that right before I launched into a talk of our competing container runtime system from CoreOS that we call Rocket. There's necessarily some parts of this talk that Docker might not want in a room that they're sponsoring, but I won't be too rough on them. We think of ourselves as friendly competitors at least most of the time. And I think that a lot of what this talk will end up being about if I do my job correctly is actually explaining what that means. What is that friendly competition? Why would we do something like write our own container engine? Are we trying to replace Docker all over the world? What is the real underpinnings and motivations for why we built Rocket? What's different about it? What's the same about it? So anyway, I'm laughing quietly to myself when I hear the sponsorship thing. So today we're gonna talk about, indeed at a high level, containers and a little bit of virtualization. We're gonna talk about why you want containers in the hopes of understanding why they're so important, which will maybe help explain why we did something outlandish like write our own container runtime. How many of you guys know a little bit about CoreOS? Run CoreOS container Linux or something? Okay, I will spare you a big pitch on who we are and what we do. We do like to think we run the world's containers. We make a minimal Linux distribution that can be described as basically a kernel with a container runtime above it. Yet another hint of why the container runtime is so important to us and our vision of like the scope of everything we're trying to build, which all of these things are pieces. In pursuit of our commercial platform, which we call Tectonic, we build hundreds of open source projects. I am responsible for documentation at CoreOS and I occasionally find projects I didn't know we had. So it is understandable when developers come to the front page of our site or start perusing our org on GitHub that they might be confused by this breadth of approaches. All of these things are components or pieces that add up to what we're trying to make happen in infrastructure. From SED, which is a distributed key value store, it's very popular for writing distributed apps. It's kind of one of our most white paper applications. It solves a difficult, longstanding and sometimes academic problem in distributed computing that is the question of consensus. It does it so well that many large scale apps that you've also probably heard of over the last couple of years adopt SED as their method of storing configuration data and keeping a clear, consistent picture of it among all of the computers running in a cluster. In addition to that, there's Container Linux, our Linux distribution. There's Flannel, a software defined net, yeah, SDN solution. There's a whole wide array of things. But today, we're gonna mainly talk about one of them touching a little bit on two others. Underneath Rocket, we're gonna talk just a little bit about CoreOS Container Linux and why Rocket fits into that operating system the way it does. And then at a level above Rocket, we're gonna talk about Kubernetes, which is a container cluster orchestration system backed by the years of experience running infrastructure inside of Google and now delivered as open source software that we can all build solutions around. So, at a really high level, what is everything we're doing all about? What is Tectonic, the commercial product about? What is Google about? What is cluster orchestration about? It's a pretty basic idea here is, containers allow us to decouple the chain of dependencies between the operating system layer and the applications that we run above it. What's this mean? It means you don't have to run apps on three-year-old certified editions of an operating system that your IT or ops teams approved no matter what security flaws have grown up in that operating system in the intervening three years. Because you don't have to worry about upgrades to the operating system breaking the application because the application's dependencies on libraries and configuration, file system layouts, all the things that can change between upgrades are contained in this discrete object that we call a software container. That fits really well into our view of the world because probably the thing CoreOS Container Linux is most famous for is that it's a single system image that automatically upgrades itself in a blue-green scheme with an automatic rollback if there's a failure. So you get a server or what I like to think of as a compute node because that's really what the system is designed to be that updates itself the same way your web browser does or the cell phone you carry around in your pocket does. When there's security flaws, they're pushed to the Container Linux channels. Usually our customers see resolution of things in 24 to 48 hours, but other people wait weeks for patch set testing, compatibility testing, individual packages being upgraded on OSs that are still shipped in sort of a general purpose system and configured kind of way. Once we have all of this, we've got an OS we can upgrade and we have applications we can upgrade. We can actually change them both at their natural cadences instead of having them being bound to each other. What this allows us to do is take the result, aggregate it together into large groups of machines and orchestrate what happens in those groups of machines with software instead of with administrator intervention. The overall idea of this is that you get a system like Kubernetes that can migrate work when a machine fails, can scale up work when the audience demand increases, scale it back down to ensure most efficient resource utilization when the audience decreases without waking somebody up with a page at three o'clock in the morning. At a higher or more inspiring or rhetorical level, we're trying to democratize access to something that McCarthy and others have been writing about since the late 60s that has gone by a lot of names. I like the name utility computing. How do you make this resource, this ability to cluster machines together, available to businesses that aren't just the largest businesses with the most advanced IT departments? So, given that background and what we're trying to do, let's talk a little bit about Rocket. How many of you have ever heard of Docker? I understand they might be sponsoring this run. If the sponsorship's effective, you should all have your hands up, you realize. So, how many of you have built containers, run containers in production with Docker? Right on. So, I'm gonna stay for the benefit of your time in my time out of telling you all about what a container is. I think we all can kind of agree it's a neat, discreet package for taking apps and dependencies and putting them all in one place. Then, we want to package, verify, distribute, and run those containers. How you build containers is one interesting question. This talk is actually a little more focused on how you run containers and what kinds of containers you can run on this runtime. Because we face challenges, both with what we considered potentially architectural flaws and how Docker was laid out, and because we wanted to do some special things that were interesting and were motivating for us to do, that for very good reasons are not as interesting and motivating for Docker to do, because they're doing different things and pursuing different solutions. In and around late 2014, we started developing our own container runtime to solve some of these issues. What we wanted was to focus on building a package that could deliver three main objectives for our container runtime that we would then build all these other layers of software on top of and ship that software in containers to run with this engine. Rocket in pursuit of that focuses on security, modularity, and especially standards and compatibility. What is the container image? How do you build portable, compatible container images and run them with Docker or Rocket or runtimes that none of us have even thought of yet? So, around about December 2014, we shipped a prototype, folks started using it. What we really wanted to do here again was make sure that there was competition and conversation around the idea of a container image standard. Rocket had started the Open Container Initiative, or I'm sorry, Docker, those two names are very close together even for me. Docker had started with us, Red Hat, several other large industry players started the Open Containers Initiative, but OCI at first focused on what the runtime specification would look like and did not have a specification for what container images would look like. For us, that looked like a future of fragmentation. If we don't know how to build a standard container that we can run across multiple runtimes, if multiple runtimes arise to solve different specific problems, to operate on different platforms or CPU architectures, we'll have fragmentation between what images they consume and that'll make it difficult for that image to be truly portable. By February of last year, we shipped version 1.0 of Rocket. It already has several production users. One of them that we like to talk about the most is a company called BlaBlaCar from Europe. They're a ride sharing service, sort of a long distance ride sharing service. I assure you the name of the company sounds much cooler when the French founders of the company say it than when I say it, but their name is BlaBlaCar and they use Rocket internally because it allowed them to sign the images they were moving through their integration and deployment chains. That was the number one motivation kind of for their usage of it. By the middle of last year, we saw packaging of Rocket for certain Linux distributions. Debian, Fedora, I'm not 100% sure if we've moved out of Rawhide, Fedora, but you can definitely get a Rocket package. I think maybe in 25, the package joined the standard repos and Arch and Nexus and a few other interesting ones if you're into that kind of thing. So how does Rocket focus on security modularity and driving standards forward? Well, first of all, one of the things that we found and had trouble with in deploying containers in production onto the Docker runtime is that Docker is definitely good at delivering a really well rounded developer experience on developer laptops. They're really good at making it easy to get spun up and using containers, making it easy to build containers. In the way that they've done that, we feel like there's a lot of stuff, especially if you go back a few months and talk about the sort of original monolithic Docker demon, we felt there's a whole bunch of stuff in there that kind of reinvents the wheel. And we didn't wanna do that in Rocket. So one of the things we had consistently struggled with or had to find clever solutions we didn't really wanna bother with for Docker containers in production was this idea of process management in the NIT systems within the containers. Yahoo, very famously, about a year ago, wrote a system called Dominit and promoted it in some blog posts. And what Dominit is, is in a NIT to run inside their containers to manage processes, reap zombie processes that have lost their connection to the PID-1 within the container. And deal with all of the things that NIT systems have done on Unix for 40 years, that SystemD has done in a pretty approachable, understandable way on a wide variety of distributions for at least five years. And we wondered why would you wanna rewrite that stuff? Why would you wanna reinvent the NIT system? Literally the oldest user space process in the Unix ecosystem is a NIT. And it's 2016 and we and others and folks at Yahoo are writing a NIT. It seemed kinda ridiculous to us. So what we looked at was how can we leverage things that already exist? The best illustration of this is how Rocket uses SystemD. And when we get to sort of the little demonstration that I'll do in the middle of this, it'll really draw this out and make it clear. But suffice to say at this point, what Rocket does not do is force you to write your own a NIT system. In fact, when running a container under the default isolation system in Rocket, which uses C groups and namespaces, all the things we know and love to isolate this container in execution, there is a SystemD, what you might think of even as a stub SystemD, running inside that container, inside that pod when Rocket instantiates it. So that SystemD now actually can manage child processes. When processes die, you don't have zombie processes inside the containers. And you can do sophisticated lifecycle management, which SystemD is very capable of within the container without learning a new system for doing sophisticated lifecycle management of processes. A few other ways in which we do this is from day one rather than trying to write our own, because we ship an operating system, a Linux OS with an extremely up-to-date kernel, we've had the good fortune to pretty much be able, at least on CoreOS deployments, to depend on overlay FS and some newer technologies that are a little harder to roll out in the field if you're looking at Red Hat distributions or Debian distributions that trail the mainline or Torvalds kernel tree by a little bit. Moreover, when we did have to invent things, we tried to do it in a way that built interfaces rather than specific implementations of those things. So I use as the third example here of that approach, our work on CNI or the container network interface. When we began to try connecting interesting networking to rocket containers rather than just run-of-the-mill TCP-IP, when we tried to get into enterprise schemes like IP address management that might have a program that typically provides that, the many, many SDN systems that are in the world. We wanted Rocket to be able to work with these, but we didn't want Rocket to need a bunch of internal knowledge about those things or to carry around implementation details of those networking systems in the container runtime. So instead of building modules for each of them, what we did first was build an interface system and what I think of a lot of times is almost a VFS, a virtual file system for container networking. CNI is very, very simple. It says you need two things to be a CNI plugin. You need a network and you need a container you want to connect it to. What is that network? How do you connect it? What does that container, what configuration does that container receive? Those are questions that are actually handled by the plugins for CNI rather than built in the CNI itself. So they can look very, very simple. They can be bash scripts that do some kind of simple DHCP allocation for let's call it a mock or a test of IP address management IPAMP. And then you can scale up from there to what you're actually doing because you have an interface and something modular to plug these plugins into. Secondly, Rocket is focused on security. What are the ways that we've tried to pursue that? Well, we've wanted and when we created our own original image format for Rocket in the days before workable shared industry standards for container image formats. One of the very first things we did was design it in a way that we could have signed images, but moreover that they could be cryptographically verifiable offline. One of the things we see in production is the image signing system in brief in Docker depends on the registry on the system serving you these images. Well, if you actually, and I touched on this a little bit with one of our users at blah blah car, they have a policy and enforce it that no unsigned images, no invalidly signed images can ever, ever be deployed to their production systems. This is neat and works well with Docker so long as the place that's validating those signatures for you stays online and your network connection to it stays up. But it doesn't work very well if something happens at Docker Hub or happens on your on-premises Docker registry that you're running and you can't do that signature verification. Well, there's a method well known for dealing with this. They're GPG offline detached signatures. They're how packages on most Linux distributions are signed. So we talked about and built into ACI rockets native image format. This idea of using detached GPG signatures to verify container images. It's really simple, but it does mean that air gap systems can verify container images. It means that systems can verify container images during network faults and during registry failures. One of the other things that I'm only gonna touch on a little because I don't want to get too in the weeds with it. That's kind of neat that we built on top of this idea is something that we call distributed trusted computing. And that actually integrates some of these features with the TPM that's shipped on almost all modern Intel motherboards. The TPM is the trusted platform module. It's actually outside of the operating system, outside of the system CPU. And we use the TPM in distributed trusted computing deployments to validate signatures on images and to maintain an essentially untamperable log of container execution events and sign up with TPM features. There's more info about that on the site and definitely in the rocket documentation if you're really interested in it. I don't wanna go too far into it for this talk. Last but not least, and I think perhaps most importantly because this is how rocket actually drives the idea of container standards forward. Modularity, and when we talk about modularity and rocket we really mean two axes or two dimensions of modularity. First of all, rocket is externally modular. Because it leverages system D, because it's built on top of modern Linux kernel features, because its systems for communicating with complex networking and with orchestration systems are built around interfaces like the CNI and the CRI, rocket is externally modular. It fits in well, it plays well with others. It's easier to integrate with existing system D management and boot schemes than if you have a knit and you're managing it yourself inside of all your containers. It's easier to start rocket containers with system D units because we don't have to do the little bit of preparatory dance with the Docker daemon to pre-kill, pre-load, pre-fetch and then finally run images because rocket is actually leveraging system D and system D in spawn to create the isolation environment that the container is going to run in. So that's modular or that's external modularity as I like to think of it. Moreover, rocket is internally modular. Now what we've talked about to this point and what we'll mainly talk about in the little demonstration or I'll show you some things on the command line is gonna be what we could call standard software container isolation. It's built on C groups, it's built on namespaces, there's a shared kernel, there's multiple apps running in isolated spaces atop that shared kernel. That's not the only way rocket can execute container images. Rocket actually has what we call a staged system for executing these images. It looks a little bit from the outside like bootloader chaining of modules or of kernel modules and system boot times. And what it means and what it does is let rocket execute a container in more than one way. There are three major ones that I'll talk about to illustrate them, this idea of modular execution within rocket and they are the standard one that we would call our default container isolation. There is a much less isolating mechanism for running things that are bootstrap programs or that need to configure networks or host file system namespaces. We call that rocket fly. That actually basically boils the container down to a really secure and verifiable packaging mechanism but then runs it at near host privileges inside a charoute with the container contents on a host when you need administrator root level access to manipulate things on that host. We use that to bootstrap some of the software we run on top of CoreOS Linux. Last but not least, work between ourselves and Intel in the Clear Containers Project built another stage one for Rocket that actually uses the KVM hypervisor and the VX bit on Intel hardware to take again this same container image and isolate it in the way virtual machines are isolated with its own kernel without a shared kernel running in VM protection under hardware protection. So those are three ways we'll look at them all in a little tiny bit of detail after I mentioned something else really important. Rocket runs Docker images. What this means, and this actually tells you a lot about our point of view with Rocket and what we're trying to do with it, Docker has really great tools and an ecosystem around building images. It would be challenging to try to replace that. Nevertheless, we have certain demands and requirements for production deployment of those images. Because Rocket can run Docker images, we can build and maintain CI chains and all of our old processes that yielded a Docker container in the end, but just push it and run it in production with the Rocket runtime instead of with the Docker runtime. So it's a key point and it helps drive the conversation around the container image specification that has now been generated at the OCI consortium in which we are discussing and working on with Rocket, which is actually very close to a 1.0 release. So let's take a look at some of the stages, the stage ones of these different ways of executing the same container image with Rocket. I've mentioned already a lot of this in passing because this is the default Rocket, if you will. But the default stage one looks a lot like Docker in isolation when you see your containers running. We use C groups, we use namespaces, and we use systemDN spawn and drive it to set up those environments to set the container executing in them. All container applications, when executing, execute inside a machine PID space. So from systemD's point of view, they look like a known entity with a manipulable handle, that is the handle from systemD to the machine ID space, and then from there to the systemD running within the container in that systemD ID space. So there's a chain of management, of process management that actually extends all the way through a container running under Rocket's default stage one isolation. This is neat because it lets administrators use systemD to manage Rocket containers when they're running single nodes as a way to deploy services in small environments instead of learning a different init system or a different way of starting and managing applications. Second one, KVM isolation. Again, I want to mention that Intel did a lot of work on this, and so the Clear Containers project contains a lot of the fruits of these explorations. They actually use KVM and take your same container image and run it above a hypervisor as if it were a virtual machine. Why is this cool? Some of the things we do with this are migrate legacy virtual machine images into cluster orchestrated environments. What that means is if I can turn a VM into a container, run it with Rocket as a VM, because Rocket is managed by systemD, because Rocket is managed by systemD even when Kubernetes is running Rocket, it means that I can actually orchestrate VM images in the exact same way I orchestrate regular container images. So that's kind of what this work is about and what the aim of it was. Some of the neat things we did and have done continue to do with KVM isolation are actually running OpenStack on top of Kubernetes and using Kubernetes to manage OpenStack lifecycle. If you've ever run OpenStack, you know that there is some lifecycle management involved. You have a bunch of Python apps. They do occasionally fail. It's really, really neat if you can orchestrate them and start them again in an automated way. Then third, I talked a little bit about Rocket Fly. Now Rocket Fly basically takes a container and says I'm not all that interested in isolating this thing. I need this process to be able to do low level things on the host where I'm gonna run it. But I am really interested in being able to validate a signature and pull that container from the same place I get all of my other containers and stick with all of my build time, build chain, CI chain, or deployment strategies and policies. And those are built around containers. So for these special applications that have low level work to do on a host, Rocket Fly lets you distribute them in the signed container and run them on the host with the privileges that they need to access low level host features. Perhaps most importantly, what the stage one is all about is the ability for the open source community and our partners in the ecosystem to create stage one environments that match their particular demands. The KVM stuff is a good example of this. We didn't invent that at CoreOS, but it is useful to us for things we do, and it definitely met aims that Intel had in the product. By having this modular system internally, we feel like Rocket's a platform for developing the things you need to do with containers in your own special situations or in your own architectures. So let's take a look at running a container or two with Rocket. This will not be a super fancy demo, but I do think it helps show what's going on when we run something with Rocket. So a few of the things we do with Rocket look pretty standard. There's a list command to show me what is or isn't running. There's a images list command to show me what container images I've already put into the local content addressable store. I have several, because I know better than to trust Wi-Fi when I'm doing these, so I have already gotten the images, but we will run them live. So we have a run command that's actually very, very simple. Obviously, if you wanna connect these things to outside networking, there's configuration options for the Rocket run command, but I'm not really gonna look at that today because what we wanna talk about is the isolation of the default stage one and what that looks like and how Rocket works with system D. To do that, I'm gonna run a container of my own build that we store at Quay.io, our container registry that has some neat features and security scanning. This container image is the caddy web servers. Anybody ever heard of caddy? It's a neat little web server written by Matt Holt. It's in Go. That makes it really cool for container demos because it ships as a statically linked executable. There's only one thing in this container I'm about to show you. There is no from operating system image. There's no package manager inside this container. There's only a caddy binary and a default HTML file to serve. So when we set caddy running, I've actually just gone ahead and done it interactively here so we'll see logs and feedback which this will be about all there is. Caddy's told us that it started up successfully. Let's take a look at on the Rocket side what that looks like. First of all, boring enough, now we have something in the Rocket list command because we are running a container. That command tells us about the default networks that container has been connected to, as you can see in the IPv4 setting. It tells us what image and what version of that image we're based on. This all looks a lot like Docker PS if you're familiar with Docker PS and it's doing the same job. Here's some things you don't see when you do Docker PS and that's your only source of monitoring status. Let's take a look at just a generic system cuddle status on the machine where I just started the server. Usually I walk around in point but they've asked me not to for this recording so I'm gonna highlight this. Here's you can see when I did the Rocket run, we use systemdn spawn to create a machine slice for everything we're gonna run in this container in this pod. Inside of that machine slice, if you go three lines down you'll see that it itself has a system slice with a service running in it that's called caddy.service. That service is managed by a systemd running in this container along with whatever else you run inside the container. So if to systemd this is sort of what the processes look like and how they can be managed because the thing that I should probably draw out here is what that means is I can do systemd cuddle restart on that caddy service. I can manage the entire machine block that any container's running in with systemd's virtual machine management functionality, the machine cuddle suite, right? So these are all handles for skilled administrators to manage container apps without learning a whole new set of container tools. Moreover, they're API handles for interacting programs to manage containerized applications without mastering an entire new system of tools and philosophy and architecture and a way of thinking about that. So let's look beyond systemd a little bit and just take a look at this running directly on the host. And this is actually where we can get and see visibly the systemd stub as I call it. I mean, it's actually just systemd. Stub isn't really a good term for it but it sort of helps people understand what's going on here. This is a systemd responsible for being a knit within this container process. So in short, long and short, that's what a process running on a host under Rocket looks like. As you can see, the differences between Rocket and Docker are fairly minimal in terms of the Rocket commands for managing containers but the fact that you gain systemd and knit handles into all of these containerized processes is sort of the big win of how the default stage one does container isolation by leveraging systemd. So actually, before I kind of link Rocket to the next level up, cluster orchestration, so you may have any questions? Go for it. Okay, so the question, I'm gonna repeat your question unknowingly for the video and so everybody can hear it. So the gentleman asked, okay, he sees in our process listing here that I have used sudo to instantiate the Rocket run command. It's a really good question because we do actually talk a lot about how much operation Rocket can do unprivileged and that we've tried to think about privilege separation in the implementation of the subcommands. There are two subcommands that manipulate the environment and require root. Run obviously fly and the different versions of run are gonna require it and then the image store management commands which are called GC for garbage collect, you also have to have root to run those subcommands. You'll notice though that I can do Rocket list, Rocket fetch of images, I can't type though, but I can do all these other things. Anyway, you can notice that I can do a fetch, I can do Rocket list, I can do most of the Rocket commands without needing the root privileges. So it's actually a good question, but there isn't, there's not really a way for us to do the system, the in spawn part without having root during that transition. So what we've done in terms of privilege separation is try to say, none of these subcommands ever need root unless they really, really need root. So you don't need to be root to list images, list the image store, do basic management or at least monitoring tasks. Anybody else? All right, so one of the reasons we care so much about this is because what we're really doing and sort of the underpinnings of our commercial product is this idea of cluster orchestration. Our bet on cluster orchestration and what it looks like and what we think is gonna work as this sort of works itself out in the industry is Kubernetes. Kubernetes is a cluster orchestrator based on the Borg and Omega products or projects at Google. Essentially, Kubernetes is Google's knowledge of how to run massive cluster infrastructure and containerized applications at scale delivered as open source. We had originally at CoreOS written an orchestration system of our own. It was called Fleet. We actually just officially deprecated it fairly recently but our work on it has slowed down certainly over the past year because as Alex Pulvy, founder of CoreOS, likes to say, Google sort of completed our sentence. We had done Fleet. Fleet had about half the features we really would have liked for it to have over its lifespan. Kubernetes came out, it had all of those features and an API and 30,000 engineer years of Google expertise in it. We have a lot of engineering expertise but we don't maybe have access to that. We have been helping people run applications at these kinds of scales, almost all of us in the company, almost all of our careers but Google has been running applications at this scale on this kind of architecture. We've made a heavy bet on Kubernetes and actually it's at the core of our commercial offering and in fact, organizes in principle in an architectural style or aesthetically, you might say, our open source work as well. What is Kubernetes? It's a system for making a lot of computers work like a single unified compute resource. Beyond that, it looks into automating management tasks when a machine fails, whereas it's working to migrate. When a container errors out, how does it get replaced? When a bunch of new people come to consume a service, how do you scale them up? Automatically, that's essentially what Kubernetes does. It does so through an architecture that looks a little bit like this. Up at the top, you've got a control plane that keeps track of everything, assigns work. It talks through an API to all the individual compute nodes. Each of those compute nodes runs this thing called a kubelet. The kubelet has one job. It really has a lot more than one job, but for purposes of simplification, what its job is, is to know how to run work on that compute node. In the default Kubernetes installation and especially early days of Kubernetes, the way you ran work on each of these compute nodes was in a container under Docker. We have done a lot of work over the last six months to make not only it available optionally for you to run containers on your compute nodes with Rocket, but actually, more importantly, for it to be possible for you to run containers under almost any container execution system on your compute nodes, because instead of just going into Kubernetes and trying to hack Rocket in there so that Kubernetes could talk to Rocket, what we first did is I touched on a little bit before in this work that we formerly called Rocket Nettys was refine and simplify the interface between the orchestrator and the container runtime on each individual compute node. And working with the CNCF folks, Google folks, Kubernetes, wider open source community, Docker themselves, we actually boiled that down to a system we call the CRI, the container runtime interface. This is a standard by which an orchestrator can communicate with container runtimes on compute nodes. What it means is Kubernetes doesn't have to know a whole bunch about Rocket in order to work with Rocket and vice versa. It accomplishes much the same thing as the CNI does for networking plugins, but in the space of the runtime. That's really been the driving principle of what the work in Kubernetes to make Rocket available has been about. What we wanna make sure is that at each point, what we have is a standard that is accessible for an ecosystem and an industry to build solutions with around and on top of each of these key spaces. We've talked about it for container images and why we've done work first with ACI to try to spur those ideas, then with OCI to actually coordinate them, build consensus around them and deliver them in a usable way. CRI is the same mission in a different place. Make this interface between the orchestrator and the runtime a well-understood one, a simple one and a modular one, and people will write runtimes that solve problems we maybe don't even know we have yet. So currently, this is kind of what that looks like. Each Kubelet uses CRI to communicate with its container runtime. That looks one way for Docker, but for Rocket, it looks a little bit like this. The CRI specifies how the Kubelet can communicate with SystemD and Rocket's own read-only API service to monitor, gather information about the containers running on any given compute node and instantiate new containers on those nodes through the default cycle of SystemD in spawn that I showed actually in the demo a little bit earlier. So one of the advantages of the external modularity that caused us to make choices about working with SystemD instead of reinventing those wheels is actually easing the connection to other systems like orchestrators or other operating systems because again, we're trying to present well either clearly defined well understood modular interfaces for new things or even better yet, we're trying to just use SystemD, something that lots of people already know how to program against or operate. So, I got a little ahead of my slides here. I've already just told you these things. This is why we did this. The bottom one is really the key thing. The mission behind Rocket itself, behind ACI originally, behind the work from ACI continuing on in the OCI image spec so that Docker and we and others can actually share it is really about driving standards and interfaces because we believe this infrastructure is as important as something like the personal computer or the internet. And we would be in a terrible mess if TCB IP was the commercial provenance of a single vendor. It cannot be a standard unless it has more than one user and if nothing else, what Rocket does is allow the creation of standards because it presents a second user and especially a year ago presents a second user where there was no other consumer of that interface. If Docker were going to be the only runtime there would really be no reason for Kubernetes to have a neat, modular, clean, well thought out interface to the container runtime. You've got to have the competing implementation to sort of boil quality into the standard is how we look at that. There's some other reasons we want it for ourselves and that is Rocket does some things that Docker doesn't. I talked about them a little bit. We do some things where we want to run pods as though they were VMs or vice versa. We want to run VMs, legacy VMs for certain customers and schedule them and orchestrate them on large groups of machines as if they were just another container. Now, obviously to say that in a sentence is a little bit easier than to actually do it but Rocket gives us the primitives to do it. Anybody who's worked with containers a bunch does anybody want to make a guess about what the hard part is about the VMs or containers actually and migrating them around in orchestration? Compatibility is tough. Like if you have five different container images you're not going to be able to run them with different platforms or with different engines. The thing I was thinking of and the thing that has always got you in migrating applications around is persistent storage. And because I don't have a clear, absolute answer to what I think the industry picture that's going to be I'm not going to talk about persistent storage in this talk today. Needless to say it's really cool. If you have a statically compiled app it doesn't have links to libraries on the operating system it becomes really, really easy to migrate that around a bunch of compute nodes when they fail unless it happens to be a database that needs to store a lot of people's credit card numbers and not just throw them away every time a new container's instantiated. So there's several looks at how this might work out. There's the volumes idea in Kubernetes. There's folks who say, you know, handle it all in terms of like NAS and just back it with SEF somewhere outside of the cluster like you did before. I'm not sure any of those answers is a 100% answer yet. But anyway, the challenging thing about this idea of migrating VMs around is continues to be their persistent storage. So to kind of go back to the top, here's what I hope I've shown that we're trying to get out of Rocket and maybe enlightened you a little bit about. Back to basics. Containers are good because they let us decouple applications from the OS. They let us update those two different entities at their natural cadences without artificially tying them to one another. Doing that gives us better security on the operating system side because we don't have OSs that were certified three years ago and the ops team isn't willing to upgrade them because different departments may have dependencies on different packages. At the application side, we have better security because we can sign and validate a known discrete image around which we can, or with which we can distribute those applications. And best of all, we can orchestrate the result as one unified resource by leveraging orchestration tools like Kubernetes because the operating system can update itself. Kubernetes can manage the cycle of those updates and migrate work when they need to happen without requiring downtime for the end consumers of that service. These are all of the advantages that we're trying to drive out of like this stack or this platform as we see it. And again, and especially in the community kind of way, and something I think about a lot because I tend to talk to community events like scale as opposed to AWS re-invent or more commercially oriented shows and demos and stuff. This idea of democratizing access to this is actually very exciting to all of us at CoreOS. Like a lot of us have spent fairly long careers in either operations or programming, building distributed systems like this from like I did work on the Plan 9 operating system as a major fan and an occasional open source contributor to it. This feels in a lot of ways like the industrial part of these solutions and this kind of architecture that I've been fascinated with for so long is catching on industrially. That there's now maybe a point where we can really deliver this to people who use it instead of either really, really large technology or organizations or NASA JPL. I mean, you know, sort of the roots of where this technology comes from. It's exciting that the idea that we can deliver utility computing in a way that folks can actually use in their businesses. A few highlights before I wind up and ask again if any folks have questions. These are things I think of as markers or kind of milestones of that maybe this effort in this pursuit has been worthwhile. I think as I kind of started out saying in the beginning there is a friendly competition originally between Rocket and Docker. There's now other efforts like OCID at Red Hat. That's an even friendlier competition and we have an excellent relationship with Red Hat. What we hope we've done here is made a lot of those things possible or made them happen sooner than they might have otherwise by pushing these efforts. And so in Kubernetes for instance, you can see that the CRI is adopted and is the definition of what the interface between Kubernetes itself and the run times it actually uses to do work on compute nodes. That's a real thing. CRI is how you do that. CNI is the Kubernetes network plugin model. So this idea of the container network interface is in fact now becoming a standard for more than just how you connect containers in Rocket to their various networks but actually how you connect any container running in a Kubernetes orchestration system to various networks. And if you're a developer how you write network plugins that can work with those systems. Last but not least, I would mention that a lot of the work that you've seen in Docker over the last year, I'm not gonna claim that we got them to do it. They're very good programmers and I think that they were aware of a need for certain architectural refactoring. But I think Rocket really illustrated some of the ways that that refactoring might happen. And if you look at a really modern Docker like say since 1.12, 1.13, the very latest releases, when you do a PS now you see a system minus the system departs that looks considerably more like Rocket because all these things that were formally tied in a single monolithic binary that was both demon and client been broken out into the Docker client, container D, RunC that actually does the container running and is the specification for the runtime. So I think that the improvements architecturally in Docker I'd like to think that we helped explore that space with them and help prompt some of those changes as well. And I think all of these things and including OCID and cryo and some of the other efforts we've seen others building their own container engines. I think we're really helped out by the fact that we have this sort of idea of what the interface looks like. You know now if you write a new container engine and it adheres to CRI that Kubernetes will eventually be able to use it as an execution engine on individual compute nodes. I think that's a good value proposition if you're looking to build a container engine like that to know that you can plug into sort of the ecosystem around scaling this stuff up beyond a single machine. Last but not least, how many times have I said last but not least? Can you have five lasts, is it not least, yeah? Are you counting? I figured you were counting. It is, so once again but not least. All of these things do lead up to a symbol into our commercial product, it's called Tectonic. Tectonic takes pure upstream Kubernetes and bundles it with a lot of things you need to actually run it in an enterprise like how do I connect this fancy new cluster to my LDAP system, my Active Directory where we store all our users? How do I account costs in this cluster to user groups or departments in the way that we're used to in our large fancy F-500 corporate policy? Tectonic adds some of those pieces around the basic what's exciting to us as developers parts of Kubernetes, the actual tech part so that it is something that you could maybe eventually move through a door of a real business and get your managers to buy into. Tectonic has almost all of the pieces of open source software that we work on either in it or contributing philosophically or aesthetically to it and how we look at how things ought to be done there. Again, Etsy is a key value store in Kubernetes itself. Kubernetes can optionally use either Rocket or Docker as the runtime and obviously in a Tectonic deployment we put CoreOS Container Linux as the foundation piece and automatically updating operating system piece that sort of makes all this go, makes it work and makes it hang together and actually drives some advantage out of the idea that you have a whole system to manage a big group of computers instead of just a big group of computers managed by some Sysad men. These are some links. These slides will be available both through the conference and also at my speaker deck site for slides. A link to that speaker deck is also in these links so you don't have to memorize these for the quiz afterwards. Speaker Deck slash JoshX, you find almost all of my talks including this one, I'll have it up by this evening after we get out of here. And I'd really love it if you'd all join us in San Francisco for our own open source conference. We will have speakers from, I'm really hoping Josh Burkus from Red Hat. He keeps promising me he's gonna send a submission ever since we were in Brussels at the beginning of this month together. And he says again he's going to, we'll have a lot of great speakers from really throughout the open source community that surrounds Kubernetes and CoreOS. So check it out. And in fact, if you're interested in speaking, please send us a CFP. You will actually have to check the site to tell me, I'm not sure when the CFP close date was. It's very soon, so do it soon. Does anybody have any questions? Okay. I would actually say, let me first repeat the question. Okay, so one, as I repeat the question, you can tell me if I misheard the part in the middle because I had just a little bit of trouble hearing you when you're asking, but the question was, I touched on challenges in container cluster environments, like yeah, it all sounds really great. And you just move your apps around whenever they fail or you need to scale up. There's no problem. What about persistent storage? How do you bring it with you? Where does it go? So the question is, could I speak a little bit to what are the challenges of networking in these kinds of environments, right? For the purposes of, on a daily basis, I come out and I give a demo of, here's a container. Here's a container connected to a network. Here's a web browser consuming that web server. Here's that container failing and it being automatically spun up by either system D or if we're in a cluster by Kubernetes, notice that I'm still connected to the service. At that level, at your basic connection of TCP IP, you mentioned host names, but the truth is, is that containers shouldn't internalize host names. Like if you want to round that problem, like I would suggest do that with an environment variable or something so that that happens at runtime instead of being built into your container. Because on a cluster, you're almost never going to get a repeatable host name. You know what, that's actually, so I think you maybe just accidentally inform me exactly what you're really asking about because here's where I do, where this is hard. But interestingly, to me it's, I actually may be, again, talking about a persistent storage problem because what containers do I have that have their host names baked into them? Web servers that have TLS assets. The short form answer to this is the ingress object or ingress notion in Kubernetes is fairly new. I'm not sure if it's in the beta channel or not. As of 1.5, it's either, it's been in beta for a while, a Kubernetes beta channel. It may have moved out of that in 1.5. I haven't checked. I'm like, we're on 1.5.3. I don't know if it came out in that series. But ingress, which I am not an expert on yet because it's fairly new, is at the cluster level the solution to this idea of what is the actual endpoint of an external service and how do I deal with it being remapped over and over again when it has a bound name for external reasons. And the most common external reason is TLS assets, right? So how does that map to, it would be what I would suggest you look into, right? Or we would have maybe afterwards if we could look at your specific container, maybe I have an idea. And it may just be that I just don't know for that particular question. The last thing I would say about it is that between ingress at the top side, which is sort of the outside edge of the shell, like how you expose services and map them in to the maturity of CNI for how you actually connect containers to the container pod network in Kubernetes. All of that stuff works really pretty good and is not a major pain point for us. But I get enough about your specific question to understand it. I maybe don't have like a, here's the exact answer for what you do if you're like, thank you. Go for it, along with Kubernetes. Okay, along and with. Well, to get started with Rocket, it's really easy. You can, like probably the very easiest way to do it is to spin up a CoreOS AMI on AWS. And then you'll have Rocket already installed and you can say Rocket run whatever your favorite Docker container is and it'll run it. And it will actually seem boring to you. It's so easy. It will run the Docker container under Rocket, but Rocket knows how to talk to Docker container registries, how to fetch Docker images and how to run Docker images is compatible with Docker images. Moreover, and I really should underline this, the OCI specification that is a joint effort between folks that we've talked about is based on the Docker V2.2 image format with information and experience that we learned from ACI built into that same spec. So in the future, hopefully this'll be even less of a question because how do I build a container for any container runtime will have a standard answer rather than here's how you build one for Docker, here's how you build one for Rocket. If I were brand new to Rocket, I would continue building images with Docker right now today. Or I would look into the OCI image spec tooling for building those images. The ACI effort always was an experimental image format that we used within Rocket. Unless I was trying to learn about image formats or something, I probably wouldn't start building ACI images today. ACI's purpose was largely to spur the OCI spec and having done so, most of our thinking about and work on that would have gone into ACI is now going into that OCI spec. So the easiest way to just immediately run anything with Rocket is to whatever Docker image you like and you run every day. Instead of saying Docker run tomorrow, say Rocket run. Two, there is a table. And if you go to coros.com slash rocket, there's actually a table that lists out some option differences, like syntactical differences of like, okay, in Docker you have this idea of like dash P 80 colon 80. We've mapped external port 80 to internal port 80. In Rocket, the command, there's like not a short form, it's dash dash port and you specify the protocol of the port instead of just a bare number. It's accomplishing the same thing but there's a little bit of syntactical difference. There's a guide, like a side by side table of if you do this in Docker, here's how you do it in Rocket. To get started with Rocket in Kubernetes, you want to look at this last link from my slides here as a, you can also go to the Kubernetes docs and under the sys admin section, there is a Rocket section. But if you go to this blog post, it'll take you immediately there and explain a little bit about what's going on with the Rocket Nettys effort. And that's the best way to get started with Rocket in Kubernetes. We'll be right on, thank you. Go for it, bud, to coros and Rocket. Well, you could, if I were advising you in like a pro services way, I would probably say you could do this in two stages. First thing is, is what you want, like what you're after is automatic updates, all the neat coros container Linux stuff. You could migrate to that OS and not move away from even Docker. We ship Docker. We only really ship two major pieces of software in coros Linux, Rocket and Docker. They're both there. But if you wanted to do absolutely no work on your migration, you could switch to container Linux, get automatic updates and all of the container Linux stuff and just keep running your Docker images under Docker. Secondly, it touches a little bit on the answer that I gave this gentleman. Because Rocket can run Docker images, you could then begin one by one with your important container images, looking to that table of the differences in how you give configuration. There's just syntactical differences. And simply experiment with them. Okay, I do a Docker run and it has two ports and two volumes attached to it. How do I express that to Rocket? The docs will tell you that. And then run it under Rocket. Test for regressions. Make sure everything works as expected. But that would be a one by one container way to move. Because the thing you will have to adjust for, the thing that we're like, we are compatible with the container image, but we express the configuration in different ways than Docker does. So that's actually what the thing you have to figure out more than like, how do I rebuild this image or do I put it through a conversion program or anything like that? Sound good? Thank you for the question. Anyone else? No? Well, with that, I hope it was useful. Thank you very much for coming. Thank you, Skale, for having us. It was very cool. Was it any good? Did you like it? All right. Testing, testing. One, two, three. All right. Thanks for coming, everyone. I know it's the last session of the conference. Hopefully you've had a great time. I'll try and get you out of here a bit early. I know everyone's tired after a long couple of days, but hopefully you can learn a bit here. So hopefully everyone here is here to learn a bit about POST and how we do auto scaling at Yelp. If not, there are some other great sessions going on, I'm sure. But hopefully you all stay. So a little about me before we get going. My name is Nathan Handler. I'm at Nathan Handler on Twitter and Handler on almost everything else. I work as a site reliability engineer at Yelp, where I'm one of the main POST developers and maintainers. I'm also an Ubuntu and Debian developer and a member of the FreeNode IRC staff. So the typical recruiting sales pitch there. Yelp's mission is to connect people with great local businesses. We had about 97 million unique users on mobile, 115 million reviews since our inception. 74% of users are accessing the site via mobile and we're in 32 countries. So with that out of the way, in order to really understand POST and where we are today and the decisions that we made, it's important to go over a little of the history of how we used to manage the code base at Yelp. So way back in the day, we had a big monolithic Python application, about three million lines of code, probably similar to how a lot of companies started out. So as a result of this, all of our builds and deployments took a really long time. Tests took forever to run. If there were any issues along the way, we had to rerun them. And those issues often meant that we could deploy even fewer times than we were already doing per day. This also meant that every mistake that was made was extremely painful. It would have a large impact on most of the developers in the company because everyone was working on the same code base trying to get out in the same push. Issues were very difficult to find among the three million lines of code. And they were slow to fix. We could only do so many deploys a day. So if the morning push maybe had an issue, maybe it got canceled and then in the afternoon push, find out that your code caused an issue, maybe your fix can't even get out that day. So we knew we had to find a better solution. So we began moving to a service-oriented architecture. We did what most people do and what some of the talks at this conference have gone over. We started by splitting the features out into different applications. The smaller services allowed us to do much faster pushes. Each service would have its own pipeline in its own push process. They can deploy as many times per day as they wanted without fear of affecting the other services. It also meant that it was a lot easier to reason about issues. If we discovered an issue right after a service did a push, it was pretty easy to trace it back down. They also had much smaller code bases. And it also gave us our first real opportunity to start scaling individual components of the site. When we had the monolith, if we wanted to scale, we could add new servers, deploy the monolith to the additional servers, but we couldn't scale up or down individual components. So we were scaling up the whole thing at once and it just, it doesn't work out in the long term. With the service-oriented architecture, each service could be run on as many hosts as that particular part needed. So I've tossed out the term service a lot. It's probably worth making sure we're on the same page here. So a service at Yelp is typically a standalone application. It's stateless. It makes things a lot easier for anyone who's dealt with this before. We isolate them into their own Git repositories and typically at Yelp, they have an HTTP API. They're often Python with Pyramid and Uisgi and run within a virtual environment. So deploying services back in this old infrastructure there or I guess our medium one, it was a statically defined list of hosts. By that I mean each service had a file that literally just spelled out host name followed by host name followed by host name. The operations team was responsible for managing these files. So if a service needed to scale up, the operations team would go review what resources were available on our different servers and add and remove host names from these lists as necessary. Monitoring was all very manual through Nagios and the deployments were pretty manual with a lot of our sync tossed in and it worked. We were able to scale up the individual components, ops knew every service, knew its resources, knew their limitations. We could quickly tell when we were hitting bottlenecks and adjust as needed. But eventually we started getting a bit too many services for this to last. It just doesn't scale past a certain point here. And that brought us to pasta. Pasta is Yelp's platform as a service. Builds, deploys, connects, monitors, all the services we run at Yelp. It's really just some glue around existing and established open source tools such as Apache Mesos, Marathon, Chronos. But it has a lot of Yelp specific logic to really tailor them to our use cases. It's also been open sourced, it's fell bond GitHub and we have an IRC channel on FreeNode. So there it is. So I mentioned that past is glue around a lot of components. Here's a bit of a diagram here trying to show the workflow. So the developer will push their updated service to get. Jenkins will pull down the changes, run it through a pipeline. It'll then apply a few different tags to the repository noting which version should be running currently. It'll also push an updated Docker image to our internal Docker registry. We have Marathon running which handles all of our long running services and that'll use the logic from Git and the Docker image to run all of the new services on the appropriate hosts and then sensual automatically handle monitoring it and learning the service authors when necessary. So I'll go into a bit more detail about these components because I know it's a bit overwhelming. So service at Yelp is pretty simple. Here's a minimal one. It has a Docker file, a make file and a simple PHP script. The status file is just for health check purposes and most services can be done semi-automatically. So the Docker file is pretty minimal. Or the make file is pretty minimal here. It takes in a Docker tag which will be provided by Jenkins and that's how we're able to locate the correct image. It's incorporating to the Git tags. The I test part is the crux of the make file. It'll get called by Jenkins which will ultimately cause cook image to get called and a Docker image with the correct tag to be prepared. It's also doing a brief health check here to make sure that the created image can be run and it'll actually come up to give us a little more confidence before we spin it up. The Docker file, I'm assuming most of you are familiar with Docker so I'm not gonna spend much time here. It uses containers which allows to keep our services language agnostic, version agnostic which was a big help when moving away from the monolith which was running a very old version of Python and stuck on an old version of Ubuntu. Now most of our services are able to run the latest operating systems, the latest versions of languages. So the Docker file here, nothing really past the specific other than the last couple of lines where we forced Apache to listen on port 8888. It's part of our contract for some discovery purposes there other than that, it's a pretty simple Docker file. We have some helper tools built in which will notify you that your service is set up correctly or if you're missing certain things. You can see here most of the stuff is green. There are a couple of red marks where I left out certain functionality. So at this point, the developer has been able to make a change to their service. They've checked that it's good. So the next part has to do with Jenkins there. The Jenkins pipeline is all codified. We have a simple deploy.yaml file which specifies how the pipeline should be laid out. So it'll run the eye tests, it'll push to the Docker registry and then it'll deploy basically everything. Dev.everything is what we refer to as a deploy group. It allows us to deploy to different clusters at the same time or gives us the flexibility to deploy it to the different clusters in a certain order. So maybe we want to deploy to our dev cluster first then test it out with a larger sample set on a staging cluster and then finally go to production. So the deploy.yaml is where that would all get configured. It'll produce a sample Jenkins pipeline such as this. It just runs through. This shows about five different runs of the pipeline. It's pretty typical Jenkins. Jenkins will ultimately tag the get repository. This is how we control a lot of the actions surrounding the service. So you can see here a sample deploy tag, a stop tag and a start tag. The numbers in the middle are just a timestamp. Pasta will use this to determine should the service be running? Should it be stopped? Should it be, was there a deploy at a certain point in time? So this is all tracked and get. It's very simple. It's distributed there, it worked well. This ultimately goes along with the declarative control structure that we have for pasta. We try and describe the end goal, not the path to get there. So we tell it this particular get shot should be deployed to production versus commit this, we tell it to deploy this shot to production, sorry if it's late here, versus telling it that this particular shot should be there. It's sort of like a gas pedal versus cruise control here. We don't tell it how to get there. We tell it what should be the end state. Pasta info is then your friend after that. You can see the different end points there for smart stack, which we use for discovery. I'll talk about that in a bit, the get repo, some metadata about it. It's just a quick way to discover information about a service. So at this point Jenkins is run. Jenkins is responsible for the Docker, pushing to the Docker registry. So the next step is actually getting the service up and running. This is where Mesos and Marathon come into the mix at Yelp. Mesos is an SDK for distributed systems, but the batteries really aren't included. It requires quite a bit of work and configuration to get it to actually do anything useful. So to do this, we have two frameworks that we use at Yelp. We run something called Marathon, which keeps all of our long running services up and running. And then we have another framework called Kronos, which is sort of like a Kron framework, for anyone familiar with Kron. We use it for all of our scheduled batches, while as some ad hoc ones. They're all very similar there. They take advantage of Mesos and they'll handle finding an appropriate host to run the service on. They'll handle making sure that it actually completes as well as a lot of other aspects that we'll go into in a minute here. We also heavily rely on using Docker as a task executor. As I mentioned before, Docker has allowed us to get flexibility in terms of languages and versions. It also makes it really easy to run multiple versions of the same service on the same physical host, which is something that our old V1 architecture wouldn't allow, since we were just specifying host names back then. There was no way to specify the same host name more than once. So the way we specify the Marathon configuration is with another YAML file. This one's pretty minimal. We give it a name for the instance, so this is just called main. We tell it how many CPUs we require, how much memory, and then we tell it how many instances it should be running. So a simple three line file, and this would be enough to get most services up and running. So at that point, Marathon will sort of do the rest. It'll handle finding appropriate hosts. There's no longer a need for the operations team to manually specify them. It can run multiple copies on the same host. If we pull a host from service for some reason, or maybe we're using AWS and the host goes away. In that case, Marathon will handle doing the smart thing and finding new available hosts to render on. So we've automated a lot of the deployment process. We've removed a lot of the burden from the operations team. We've given more control to our service authors, but that's really not enough. We found that keeping things dirt simple worked for a lot of services, but we needed some more flexibility. So we added a bunch of bounce strategies. Bouncing is how we go from one version of a service to the next. So we have four different approaches for that. The brutal bounce is the fastest one, but it has no safety built into it. It'll just stop the old version, bring up new instances of the new version, and it doesn't care if it gets into a state where no copies of the service are running. Up and down will bring up all the new versions of the service, wait for them to start passing health checks, and then it'll stop, or it'll begin stopping the old versions of the service. Down then up is the opposite. It'll stop all of the old versions before bringing up new versions. The reason for all these are, some services are different. You have different schemas, and maybe you can't afford to have any of the old types of schema running before the new one starts up. But for most of our services, we find that the default crossover bounce has a good balance of safety as well as speed. It'll start up new instances of the service, and as each one comes up healthy, it'll begin stopping the old ones. It's pretty safe if your new version of the service has an issue with it that prevents it from starting up. The crossover bounce will keep all the old versions running, so it's another layer of safety we add to all of our services there. So I mentioned that it'll do the right thing, and if hosts go away, it'll find new hosts. If we add new hosts, it'll add them to the pool of available resources. A lot of this is possible due to something called Smart Stack, which we run at Yelp. We use Smart Stack for all of our service discovery. So in a simple case here, we have a service running on box one. We'll have a script that will be health checking the service, and it'll use NERV to register it into ZooKeeper. Then a client on box two that needs to communicate it. Synapse will be running. It'll check ZooKeeper and spit out HA proxy configuration files, which the client can then query directly to connect to the service. We didn't develop Smart Stack, we just utilize it, and the really nice thing about it is it works great with our hybrid infrastructure. We have some of our own data centers, but we have a lot in AWS. Smart Stack really doesn't care. So we can discover between our physical data center in the San Francisco area, as well as maybe the US West One Region in AWS. This brings us to something called wait-and-see zones, which is how we group everything at Yelp. We'll have Super Region is how we used to handle most things. It was a grouping such as NorCal Prod, which would include both the US West One Region and AWS as well as our physical data center. We also have regions which allow us to specify, maybe we only wanna run in the physical data center, not in AWS for some reason. Pasta supports services specifying the exact discover and advertise settings that they need to make this work. It also supports auto scaling at the appropriate level. So if we need to be able to auto scale and make sure that we have a certain number of instances running in our physical data center, that all work, or maybe we don't care and we just need to auto scale between the physical and AWS. All this works just out of the box. As I mentioned, there's another YAML file that the service authors can configure. This one's choosing to advertise at the region and discover at the region level. The region level would be like an AWS region or one of our physical regions. The proxy port is assigned semi-automatically to all of our services and just gives us a unique way to reference it in terms of smart stacks. So when we query HA proxy on the host, we'll use the proxy port. So that brings us to auto scaling. Up until now, Pasta was working just fine. As we would add new hosts, they would be available in the pool of available resources that services could utilize, but service authors still had to manually adjust the number of instances that they were specifying in their marathon configuration file. This meant that they had to constantly be monitoring. So during peak time, they would have to scale up. During non-peak time, they might scale down to try and save us some money. The bulk of our service authors deemed that this was too big of a hassle. So they chose to just over provision all of their services. They would run with enough resources to handle peak time and just keep it that way all of the time. And that works. You're throwing money at the problem, but it does work. We were also forced to do the same thing at the cluster level. We had to make sure that all of our clusters had enough hosts available to deal with the services running at this over provision level. So we're paying for a ton of extra servers that we really don't need most of the time. So with auto scaling, we were able to eliminate the need for the service authors to manually adjust the numbers. We were able to get rid of the over provisioning and just run with the number of services and hosts that we actually needed at that particular period of time. This allowed us to save money on the bills and it increased the reliability because sometimes the service authors just weren't paying attention. And we would hit a peak moment in time and their service wasn't scaled up enough to handle the extra demand there. So I mentioned that service authors were responsible for monitoring their service. We don't, they really don't have a ton of insight. We give them pasta status, which gives them health check information at the various levels. They can see how many instances are running there, but they don't get a lot of information. So we set out to make all this a bit easier. We wanted auto scaling similar to pasta from the start to be a very easy and straightforward process. So we thought, what would be the most natural feeling to a service author? Our first guess was a magic wand that they can wave and just have everything work fine. It would guess the right numbers, it would just do the right thing. It would have all the safety checks in place. Eventually we decided that that just wasn't practical. So instead, we decided to change it from just specifying a number of instances to specifying a min number of instances and a max number of instances. It's pretty clear when reading through what this means. It's also easy enough for our service authors to understand, instead of running with a fixed number of three instances of the service, now Pasta handles scaling between three and five instances. It'll use a lot of the default logic we have on the backend, but for most of our services, this should work fine. It'll scale up to a max of five and down to a minimum of three. As always, there are people who want more flexibility though and they wanted to be able to control under what situations will add or remove an instance of their service. So we had to add a few more toggles to make those service authors happy. So we added a bunch of metrics providers. Metrics providers allow Pasta to determine how utilized is a service. Our default one is something we call Meso CPU. It'll try and use the CPU to predict how overloaded or underloaded a service is. CPU tends to be one of our main bottlenecks for our services at Yelp. So for most services, this tends to work fine. For services that need a bit more flexibility though, we have an HTTP metrics provider where Pasta will just make a REST request to a particular endpoint. That endpoint will return some JSON telling us a float between zero and one of how utilized the service is. They can base that on whatever data they want. Pasta doesn't care. And then the final one, we use Uwizgi for a number of our services at Yelp. So this one will use the percentage of non-ital workers to determine if we need a scale up or down the service. It'll just use the basic status endpoints that Uwizgi provides. So the metrics provider will tell us between zero and one how utilized is the service. But at some point we need to make a decision. Does a value of 0.5 mean we should scale up, scale down, stay the same? So we have a few decision policies that we implemented. First and default one is a PID decision policy. Use a PID controller to determine when to auto-scale a service. It'll basically try and constantly check in the default case of the CPU and it'll try and adjust it until we get to a healthy bound. A more simple approach is the threshold approach where you just sort of draw a line and say if the utilization is above 0.5, scale up. If it's below that value, scale down. And it's pretty simple and clear. And then finally, we added a bespoke decision policy which allows the service authors to implement their own policy here. They basically tell us through JSON how many instances should we be running with? So they get full control over the auto-scaling of their service. They can base it on whatever data they want. They can base it on the time of day. They can base it on how many pizzas they had that morning. We don't care. So the way that would look in the marathon configuration file, you would still have the min and max number of instances that we saw before, but you add this additional dictionary at the bottom. So in this case, we're using the HTTP metrics provider requiring the metrics.json file or endpoint there. And we set threshold to 0.5. So if the utilization is above 0.5, we'll scale up and we'll continue to do so until either we hit the max number of instances or until the utilization drops below 0.5. As I said before, the defaults for all of these, if you don't specify the dictionary, we'll use the MesoCPU metrics provider with a PID decision and policy. And yeah, so it'll do a pretty graceful scaling if you don't specify anything. So a sample of how this actually looked and we were really surprised at how effective it was when we first rolled it out. It's a little hard to read there, but the top graph shows CPU usage in terms of the number of cores and the bottom graph shows the number of instances running for the service. So you can see as the CPU usage increased, the number of instances of the service also increased. And then as the CPU usage decreased again and sort of leveled off, we also leveled off the number of copies of the service that were running. So by doing this, instead of how we used to always have to run with the max number of instances and just waste a bunch of money, we can now let it fluctuate anywhere between the minimum and the max and it'll just conserve resources a bit better. So all of that is just at the service level. We'll spin up additional copies of individual services and tear them down as needed. But we still have the same issue of what happens when you use up all of the available hosts and you no longer have available resources to launch new copies of the service. Without cluster auto scaling, new instances would just not be able to launch. They'd end up in what's called a waiting state where they'd continue to check to see are there available resources and the minute some free up they would launch but until then they're stuck waiting. So we quickly had to roll out a cluster auto scalar. The cluster auto scalar works a little differently. It'll examine all the resources tracked by pasta, currently that's CPU utilization, memory as well as disk space. And it'll look at whatever the worst value there is. So if we're using 50% memory, 70% disk space and 80% CPU, the 80% CPU utilization will be used to determine if we need a scale up or down. And based on that, it'll check to see if it's across a certain threshold, similar to the threshold decision policy I mentioned before. Usually we aim for about 80% utilization across our cluster. We find that gives us enough additional overhead that we can quickly scale up services as needed without wasting too much money on the overhead. The cluster auto scalar also supports scaling individual pools. So a pool is a set of hosts that are devoted to one or more particular services. Usually we do this to have finer grain control over the instance types that they're running on or maybe to make sure that they always have their own hosts or that they only run alongside certain other services. There are a number of reasons, but we can auto scale all of them independently of one another. So maybe we need to scale up US West One but US East One is over provision needs to get scaled down. The auto scalar will handle that. The target utilization that I mentioned before is also fully configurable. So we've been slowly tweaking that over time as we continue to add new services as well as grow our clusters. And then finally, the auto scalar supports both auto scaling groups as well as spot fleet requests in Amazon AWS, which has proven really useful for us. We were using ASGs for a long time and eventually we decided to evaluate spot fleet requests for to see if we can save some money. We were a bit worried at first that we would just start losing them and have our entire cluster vanish, but they've proven to be pretty stable, at least stable enough to keep a service up and running and get it to a healthy point. And when we do lose some instances, we still have enough others remaining that we're able to continue to operate. A lot of this is due to careful tweaking of how much money we're bidding on the instances, but by using spot fleet requests, we've been able to sacrifice a little reliability to save quite a bit of money. We continue to use the ASGs though for our more mission critical stuff, but we're constantly evaluating. So I mentioned we had a pasta status command. We also have a pasta meta status command, which allows us to see the health of an entire cluster. So here we can see that we're using just one of the seven available CPUs in the cluster. It's a pretty small cluster here. Just three out of the 42 gigabytes of memory and 10 out of the 153 gigabytes of disk. You can see the percentage. The colors actually correspond to health checks that are going on in the background. So the pasta meta status command is used behind the scenes by the cluster auto scaler. So it'll look at these percentages and that's what it's using to determine whether or not it needs to scale up or down. And then I mentioned that we use AWS quite a bit. We're not huge fans of the council or the web councils for most of our stuff. Marathon has one, Mesos has one. We prefer to configure those via code similar to those YAML files I showed earlier. We do the similar thing for all of our AWS resources. We use something called a hash corpse terraform tool to configure them. It allows us to model all of the resources as code. So this is actually the code that we would use to spin up a new pool in AWS. This one would use spot fleet requests. It would be very similar to spin up one using an ASG. The advantage to having it in a configuration file is we all have access to it. It's all backed up in our Git repositories. We have a history of all the changes that have been made to it. So if I decide I need to revert a change that was made, I can quickly do it. And finally, terraform gives you a really nice terraform plan command where it'll show you exactly what changes need to be made in production to make production match up with what you have in the configuration files. So if I were adding this new spot fleet cluster, I would see a diff-like output where it's making a change to add the new cluster. As far as auto scaling is concerned, it's very similar to what we did at the service level. We specify a min and a max value, and then we give it an initial target capacity. The initial target capacity is really only used for bootstrapping a new cluster. Outside of that, the min and max do most of the work of constraining it. We found that the min value actually came in handy as a safety check in a lot of places there. While developing the auto scalars, both at the service and cluster level, we had a number of issues along the way. By setting a high enough min capacity to ensure that even if we drop to that level for all of our services and for all of our clusters, that we can continue to keep the site up and running. It gave us a level of safety that allowed us to develop this in a relatively short period of time. In addition to creating the actual infrastructure, the Terraform code will also upload a bit of configuration to Amazon's S3 service. This configuration will then get used behind the scenes. We have a cron job that runs approximately every 20 minutes or so, where it'll examine the configuration file to see what are the expected min and max capacity. It'll examine what's the current utilization of the cluster, and then it'll check if we've crossed the threshold and if we're still within the range, and then it'll add and remove instances necessary. We can usually spin up new instances pretty quickly. We have a pasta army that we created that outside of the scope of this talk, but it has all the configuration necessary. So within maybe about five or so minutes, we can add a new host. That's part of the reason we run with an over provision cluster because it does take a bit of time to add new hosts and takes a bit of time for those services to get deployed there. So we want to give ourselves a bit of an overhead. In terms of safety, scaling up has always been a pretty safe and easy process. Just add new hosts or you add new copies of the service. Smart Stack will handle the discovery for the service and Measles will handle realizing that there are additional resources for new hosts. Scaling down on the other hand is always a bit scary. I mentioned how we set the min instances and the min capacity to a pretty high level on our actual clusters. It's because we have had a number of instances where pasta just tried to scale down to a very low value either because of bugs in Measles or Marathon or bugs that we've committed. So we have a lot of safety checks like that. We also have checks behind the scenes to make sure that we're not removing too many hosts or copies of a service at once. So if the auto scaler thinks it suddenly needs to remove 75% of the instances of a running service or whatever threshold that we set, it'll set off a few red flags internally and it'll prevent it from happening. We have similar stuff at the cluster level. But one of the key parts that we worked on was something called pasta maintenance. It utilizes Measles maintenance primitives which they rolled out last year. While we would love to have them just work out of the box, neither of our frameworks actually supports them. There's an issue on GitHub against Marathon and I think there might be one against Kronos requesting it but it really hasn't seen any attention. So we had to develop all the tooling on our own. So the way we do this is if we decide that we need to get rid of a host whether because Amazon gave us a two minute warning that a host is about to go away, whether we're doing manual maintenance or maybe puppets running and needs to restart Measles for some reason and take everything down. We'll initiate maintenance mode. By doing this, we schedule the maintenance to be a certain period of time in the future. So we might say maintenance is gonna start 10 minutes from now. For the next 10 minutes, pastel attempt to gracefully drain the host. And we do this by scaling up the number of running instances of the service. So let's say the host that we're draining had two copies of the service running on it. We'll scale the service up by two to hopefully force them to start running on a different agent. At the same time, we're using Measles dynamic reservations to notify Measles that there aren't any more free resources on the host that we're draining. This is because Measles would otherwise attempt to reschedule the service on the draining host and prevents us from ever fully draining it. Once we either fully drain the host or we hit the start of the maintenance window, which we said was 10 minutes in the future, we'll initiate the actual start of maintenance, which as far as Measles knows, means that we're taking the host down. It'll stop the Measles agent process, nothing will continue to run and we can terminate the host because nothing's running there anymore. It's sort of a best effort training process because it's rather slow if we don't have enough additional resource available or if we're trying to drain too many hosts at once, we'll sometimes get stuck, but it has helped gracefully drain things or a bit more gracefully drain them than before. We also have smart stack, removing everything from the load balancer before we bring up the service down. So no new requests will be routed to the service, so hopefully there isn't a user interruption. So really at this point, just to recap a bit here, we've gone over how a developer will make changes to their service. They'll push it up to Git where Jenkins will begin processing it. Jenkins will add a few Git tags, upload an image to the Docker registry. Measles and Marathon will then do their job of finding available resources, deploying the service, keeping it running there. The autoscaler will be working behind the scenes to make sure that we have the appropriate number of instances running for all of our services, but no system is really complete until it's monitored. And I mentioned before, we were using Nagios with manual ops intervention before. We decided to move to something called SENSU for the monitoring here. This is all configured in one more YAML file, monitoring that YAML. We can define a team name. This'll match up to a bunch of notification preferences and information about how to page their on-call person. We have a Boolean here specifying whether to page or not. Some of our services are just internal toys and no one really wants to get woken up at two in the morning for something that's really not critical. And here I override the notification settings to have it email me. Out of the box, this gives us a lot of useful monitoring. So it'll give replication alerts to all of our services, which you'll check to see if we have the correct number of instances running and if not, it'll page someone. So when we mess up autoscaling or there's a bug, usually we have a lot of service authors complaining about getting paged for something like this, but it's proven to be a very useful notification tool. It quickly alerts service authors that there's an issue and that action needs to get taken and they didn't need to do anything to achieve this. So one question that a lot of you are probably asking at this point is why didn't we just go with AWS for all this or whatever your favorite scaling utility is? Big reason ultimately was we're a hybrid infrastructure. So we have physical data centers as well as stuff in AWS and all Amazon solutions would fail to work for all of our physical data centers. So we would ultimately have to be running two auto scaling solutions if we wanted to use one of theirs. We also have a bit more flexibility and more advanced decision policies. Amazon typically operates based on a threshold that you can configure. We allow each of our services to set their own options or even implement one completely from scratch. Then finally, we get a lot of better integration with past end mesos. So at the auto scaling level we're aware of how many services are running on a particular host so we can utilize that information to scale down hosts that really aren't running much when we need to do the cluster auto scaler or we can do some useful notification information and do some more safety checks there. So really just having full control over all aspects of running the service, monitoring it, scaling it, it's proven to be very useful and it just gives us a better understanding so when things go wrong, we can actually do something about it. Whereas when S3 was down the other day, sorry if anyone from Amazon's here to bash once again, there really wasn't a lot that most companies could do. Some had options but most didn't since we control everything. Even if all the Amazon were to go down, we still have physical data centers. We could adjust smart stack configuration so that our services run out of our physical data centers and we can keep the site up and running. So I promised I would get you out of here a bit early. Here's some more information about Yelp if anyone's interested and I'd be happy to answer any questions that anyone might have. So thank you. Yeah, so the auto-scaler will initiate the maintenance calls there. So the question was, is it registered as a framework? The only frameworks we actually have in our past cluster are Marathon and Chronos. So we are just using the standard REST API there to make all the requests there. We trust our developers typically so we don't really have an issue with that. Any other questions? Yes. The question was about no-brain situation with mesos not being respected by Marathon. We've had a number of weird issues along the way. The typical issue that we've had in recent versions of Mesos Marathon is our services will end up being listed in ZooKeeper but Marathon's not aware of them or Marathon's aware of something in Mesos isn't. We don't really have good solutions for that. We'll typically, just a restart will typically fix most of the issues but not the no-brain situation in particular. I don't have a good answer for you unfortunately. Yep. So the question was if we thought about making some of the lower levels pluggable or having a Kubernetes. We actually did do a review of Kubernetes recently and to be fully honest there, it's an awesome product and it's gotten a lot of new features that make it a really cool tool. If we were starting from scratch we would probably use Kubernetes over Mesos. The big issue is at this point we've made pretty big investments into Mesos. We've tailored it a lot to our specific use cases and workflows so trying to change to Kubernetes it would be a costly transition there and a lot of the benefits would be lost due to all the extra tooling that we've already made and gotten working but we have considered it in the back. So the question was if we used Terraform for anything outside of AWS. We don't use it to manage any of our physical servers. We still image them manually and have Puppet configure them there. We do use Terraform for DNS, CDN and a few other things and we're pretty frequently adding new providers there. Basically anything that would typically require us to go into a web council will often add to Terraform. The question was how do we pick the min and max values for the cluster? That's constantly changing. We're still in a pretty big transition process of moving some of the last stragglers over to pasta and as some of them move over they drastically affect the number of instances we have to run. Now that we've also gained quite a bit of trust in our auto-scaler, our max capacity value will bump up pretty high. So we'll allow the auto-scaler to potentially cost us a small fortune if it decides to scale up. The min capacity, we're a bit more conservative there. We try and make sure that if we were to drop to the min capacity we can still keep basically all the crucial stuff running but a lot of it's just manual tweaking over time. So as we see new big services being ported over to pasta we'll manually check or if we see services get stuck in a waiting state in the marathon dashboard it's a pretty clear sign that we have to scale up but it's all manual there. The question was do we use an AMI for cluster auto scaling in AWS? And we do use one. It was necessary to cut down a lot on amount of time takes puppets to get the box up and running. There's still a bit of time to get it in a fully functional state. And then he asked if we use cluster auto scaling on our physical data centers. And currently we don't, we're at capacity in most of our physical data centers and we're trying to move more and more to AWS currently. So we're really not trying to dynamically grow it as we get new servers that are being repurposed from other teams and converted to pasta agents. We'll add them to the pool but we don't do any sort of cluster auto scaling there. So databases they can connect to like our remote MySQL databases there. I met more within the running container there like if they wanna write out to disk or anything like that it's gonna get lost the minute we've spent up a new instance or anything like that. So they need to use one of our remote data stores. We currently don't use any of the volume functionality that MISIS can provide. We've looked into it but currently we don't offer any support for that. The question was does the scale down consider Amazon's one hour bracketed charging? In pasta no, we have some other tools that Yelp that do take that into account. It's something that I imagine we'll likely look into in the future but currently it doesn't. Especially since most of our instances we're running on reserved instances for a long period of time. But yes, the question was do we ever run into a situation where based on the scaling metric we scale up and scale down, scale up, scale down and constantly end up doing that. And we have had that happen. Some of our services have really weird usage patterns and if they don't set the scaling preferences correctly we'll run into issues. Usually we'll try and work with the service authors when we notice something like that happening. And usually as long as the min instances value is set high enough that they can continue to operate other than the additional money that we're losing and some of the additional cluster overhead. Their service should continue to operate fine. We don't scale down too frequently that we're scaling up and trying to get rid of nodes before they're healthy and stuff like that. So other than the cost, there really isn't an issue and we'll work with the service authors to fix that. Yeah, that's what he was asking. We currently don't factor that in although I imagine we will at some point in the future. To prevent the non-mass scaling up and down yeah, that's part of pasta. A lot of that's just additional thresholds that we set behind the scenes saying that don't ever scale down more than like 75%. A lot of that's also just heuristics and there's no research or anything going into those values or anything. It's just an additional safety guard but other than some of the early days I don't think we've actually hit those sorts of situations much recently or had to fall back on them. It just gives a bit more confidence to roll out this functionality. A lot of that's things to smart stack there. We have some tooling around there but it'll basically no longer allow new clients to connect to the service there and give them a period to finish up the request that they're serving. Anything else? Yeah, yeah. They're both in pasta. They do have some helper scripts to actually call them which are usually cron jobs running on the masters but the actual code is in pasta. Anything else? Yeah. The question was if we mix spot request types. I'm assuming we mean like instance types there. Yeah, we request a wide range and we'll diversify across it. We did some math on the big values that we use there as well to try and avoid losing them. Well, thank you everyone for coming.