 Hey everyone. How y'all doing? I know last couple sessions the kubecon winded down I'll give everyone like a couple seconds like some people are still flooding from the hallways So this talk is cluster in a box. That's the title on the schedule. So we'll be talking about Kind of be going over Different ways to utilize different types of containers to deploy different kinds of Kubernetes architectures so With that said, I guess we'll get started. I have to catch a flight very shortly after this as well. So We'll get going So my name is Ry Terrell. I'm a software engineer at canonical the company behind Ubuntu I work on the canonical distribution of Kubernetes Cool, and my name is Marco Chepi amongst other things. I do operations at the Silph Road How many people here I imagine most people here know what Ubuntu is they're part of Ubuntu I don't think we have to really give them much of an introduction Who here has ever heard of the Silph Road? Yeah, I just like two people. It's okay. I imagine most of you might actually know the the community that this this But this group panders to anyone here remember Pokemon Go? Remember that thing last summer is like this huge craze the media stopped reporting about it But there's still like a ton of active people. I promise they're out there the Silph Road's a site built to basically Support that community. So if you've ever seen it hordes of people wandering around odds are we're supplying them the data We're to wander towards And so we're the premier grassroots Pokemon Go community Which sounds spectacular for those of you who know What Pokemon Go is you're probably more familiar maybe with other titles from Niantic they produced ingress and they have a bunch of other Titles that are very interesting as well, but it's entirely volunteer run an operated community website Myself and the others that kind of lead the website all have full-time jobs outside of this We don't have like an income plan. We're not here to generate revenue We're just here to make sure the site stays afloat and we don't have to pay money for it We get a ton of hits to this thing You'd be surprised how popular and how continuous our traffic grows even after the initial dull and spike These numbers are a bit old now, but we're doing well over 30 plus million hits and tons of uniques every month And because we're volunteer running operated We have a very interesting kind of development cycle where we actually leverage people from the community at any one time It could be just me or someone else up to most recently kind of ramped up for this upcoming. There's an in-game release It's happening literally right now. I'm the operations guy giving a talk But somewhere behind the scene there's a bunch of code being deployed and I'm just gonna pretend like that's okay But we've scaled up to things like 55 even 60 developers and they're all volunteers from around the world We have a giant volunteer organization for leadership and community management There's a 1,000 plus research group that all gets together to collect and use scientific information To determine how mechanics work in game all these things are great, but it leads to a huge nightmare in that How do we make sure that a we keep running a really smooth operating website so that we can keep being that premier Grassroots Pokemon go community and how do we make it so that we can keep enticing volunteers to support and supply these things And so it was obvious containers would be a way for us to really expedite that development workflow and still keep sanity from us who are all volunteering our time So before I start talking about Linux containers machine containers doctors and everything I want to talk just really briefly about The kinds of containers that exist in the ecosystem today for a lot of you This may be a review for some of you This may be kind of new and we'll kind of walk through why these different types of containers are important and where We've found ways to best leverage those things so This is a this is a server. It's a machine could be your laptop could be a piece of bare metal It could be a cloud instance of VM running somewhere But most for the most part it is a machine in the most traditional sense you installed an operating system on it It's a boot to it's it's slack or it's Arch Linux or whatever it happens to be you've got a bunch of processes running You know things like a knit cron SSH Damon loggers things We've known to come in love as a core operating system processes It's got networking this kind of cute orange dot down here It's got things like discs where you can go and store and retrieve data from and from here You can kind of slice this up in a bunch of different ways the first that we're probably most familiar with his virtual machines VMS I mean they've been a staple for people from laptops Through to servers most famously all public clouds are basically just virtual machines running on big giant servers And the virtual machines are just that they're Virtualization of IO so it's emulation of hardware emulation of firmware emulation of all these bits and carving out physical resources from a machine In order to have them appear as a smaller chunk of isolated running operating system, and it's an operating system It's a knit process is its cron it's loggers. It's networking its discs thing. We've come to love Then we talk about things like Docker containers I imagine most people here are pretty familiar with things like Docker run C rocket the gambit of things that provide you with a Docker like experience a way to manage and run processes in confined fashions The most noticeable difference between things like process containers of virtual machines It's a quick review is that virtual machines run all these supporting processes and a knit process SSH loggers Cron's daemons, etc You're just running your process that you care about and then the dependencies required to bootstrap that process Instead of a Docker container So you're not running a kernel you're not running all of these additional ancillary support services just a process Inside of some form of disk I like to make it inverted because you can write to it, but you can't really persist it I mean you can you can commit snapshot, but they're really meant to be ephemeral And you can attach networking and most importantly is you could run these really dense the number of Docker containers You can run per host Trumps in the thousands compared to virtual machines You can spin up on that same host do that same workload because you're not physically carving out resources and virtualization hardware You're just simply confining and constraining a process The last container type is a machine container. This is the natural hybrid model of a virtual machine and a process container You've got a machine. It's running processes and knit, etc. It's got disk. It's got networking It's got your application. It's everything you've come to know and love from the last 20 years of Linux system administration But you're able to run it just as densely or if not as densely very close or more densely than virtual machines Because it's leveraging those same primitives that Docker containers are it's not doing virtualization of I.O It's simply using the same primitives Docker does to leverage things like isolation of processes and stuff But it gives you a machine context a disk you can write and persist to networking spaces and it processes You can SSH you can manage it just like you would a normal machine and that's kind of This kind of spread of what we see from containers today. So there's things like virtual machines There's things like machine containers. We're we talking about LXD in this case But you could count things like the clear containers project while not technically a container is a is a very similar idea to how to run Really lightweight isolation of machines and then you have things like Docker run C rocket OC ID The gambits of different kinds of container technologies and that's kind of how the space plays out So you wouldn't necessarily although you could run an init process inside of a process container like Docker But you could definitely and it's expected to and only runs inside of things like machine containers so That's Kind of where we started we said okay Docker is going to be a great thing for us But all of our developers all of our volunteers they all have different environments We've got people from literally around the world on max on arch on windows on free bsd On every platform you can imagine and to be able to present them with a Standard development environment saying here's what production looks like it looks like this operating system It's got these characteristics this software installed was really difficult So we turned to LXC as our first start and so LXC was the to us It was that og Linux container and I say Linux container because it's very similar to things like Solaris and jails. I'm sorry Solaris and zones and free bsd in jails Which all give you that same idea of an isolation of a machine? But this was for Linux and that's what we were using and we found out that things like Heroku the Paz I'm sure you may all be familiar with Heroku their entire infrastructure their entire Paz offering is built off of LXC So it gave us a nice comfort layer of knowing that this isn't going to be some spiraling security issue or some problem Where it's not going to be compatible Also found out Pibbles Cloud Foundry the engine behind that uses LXC is a way to confine and constrain for their delivery For their Paz as well So that's what we started. We started with LXC and I'd like right to kind of just walk through real quickly to show you all What LXC kind of looks like and how to interact with it so? Right what do we got here so if I run LXC list I Can get a list of my currently running containers. I have none right now. I can I can launch a container like this LXC you launch I'll launch an Ubuntu Xenio container, which is 1604 and in just a moment that'll come up So what's it doing right now? It's fetching I guess the Ubuntu image. That's right. Yeah, all right cool and so This this container this is effectively what we be talking about where it'll boot off and use the same host kernel It's basically kind of how you would normally do like a docker pull and a docker Run kind of a scenario set that's pulling an OS image down instead. Is that right? That's right. Yeah, cool. So Yeah Please please ask questions. Yeah Yeah, that's your Christian. What about the kernel? So This is an Ubuntu Xenio host machine. That's that's super awesome for us in this case. Yeah, exactly. That's really compatible But what if we were to launch say like I don't know sent us six machine or if a door machine I don't know does anyone know what the current sent us seven kernel is is it less than four four? Maybe three ten. There you go. Well to be honest We we tested this we literally spent all day testing these demos make sure that work and we'd already cached the Ubuntu images So I wasn't expecting it to download it again So we would have been well into your answer already because that was definitely part of the demo Because you're right. They're sharing the same host kernel So what happens with kernel mismatches? Turns out it's actually not that bad of a thing Most kernels going backwards are fine So as long as you have a new enough kernel running a user land that expects an older kernel tends to not break It's when you run a super old host kernel and expect newer user land features inside the containers to access that We like using Ubuntu. Riot may be I mean we may be a bit a bit bias But Ubuntu tends to ship very recent kernels even in Xenial we can install a 413 kernel if we really wanted to And I'm looking for the next long-term support release. So that's why we use it as a host OS But it's not limited to this LXC is a tool can be run on really any modern 4-0 plus kernel kind of where the sweet spot is for these things So 81% later Yeah, another question We're to download that image. That's a great question. So This is LXD installed on top of an Ubuntu machine So each distribution that installs LXC kind of also can install a preset of like in remote image repositories So there's an image repository called Ubuntu which points to a remote source that has a list of all the Ubuntu images If this wasn't downloading I would already answered your question as well How do you see all the list of images? Where are those things? How do you hook into them? That's another fantastic question. I'm really happy you guys are asking these questions It means I'm on the right track from a talk perspective. All right. Any other questions while we wait for the last 0.0 percent Yeah, hey, that's it so right, please show me what do we have now, okay? So if we take a look at LXC list again we'll see that we have a persistent machine with an IPv4 address Cool. Could you launch me another machine dare I ask sure? So you used X this time do you want to explain why you used X? Why didn't you type Xenial? Is it not is it that long? Yeah, sorry. It's it's too long. They have a shorthand in the Ubuntu namespace. I like to use okay cool So that launched a lot faster. We have the image cache. So it takes mere seconds at most And it looks like humane jawfish is now running as a container. It's got an IP address and everything Can you launch me another container sure? But instead of a boom to can you give me sentos? All right So hopefully this won't download the sentos image, but if it does I look at how fast that was man pre baking some of these things are wondrous So show me the containers we have them. We should have now an exciting piranha a very exciting piranha What um Can you prove that that sentos I guess is my next question just to make sure there's no doubt in the audience here Is that is that actually sentos machine proved to me that right sure? I can do that So I can exact into the machine context and we can take a look at the machine So let's see exact. I'm guessing this is kind of like how you can docker exec into yes as well very similar So I can do a yum update Cat it's the issue cat Etsy red hat release as well sure Right hat release That'll give us what we want And then finally to answer the last question. Can you run a you name? Yeah, absolutely So this is the entire sentos user lands everything you've known to come in love and enjoy from sentos But it's running on a fort a four dot four Linux kernel from the Ubuntu host Which is interesting actually, but we haven't had any problems at all running different types of images Right. Can you show us all the images that kind of come available by default? Yes, I mean we've launched sentos we launched in Ubuntu. What else can we launch out of the gate this more legible? Oh, yeah, that's super legible now So the for those of you in the back or those in the front squinting like me we've got let's see Alpine. What else is there? Arch Linux sentos six and seven all the debbians you can imagine Sid Jesse Weezy stretch fedora Open susa gentu oracle Linux List goes on down to Ubuntu beyond so let's see has like a default image repository as a part of the Linux containers project So they ship a bunch of these images that being said though It's you can take and create your own in images put them into your own image repository Just like you would maybe a Docker container and distribute kind of starting OS's or even golden images on top of that So that's super cool So this is how we started we use LXC and then ultimately we started using LXD So LXD is nothing different. It's just a hypervisor for LXC So before LXC was really burdensome you had to let go and run all these long commands LXD made it super hypervisor like creating containers starting and stopping them Migrating containers snapshotting them everything you want to do with a VM except at a machine container layer It's just a restful API with either a local client or over the network And we just integrate it into our pipeline So what we're able to do is on board a new developer create them a new Container that they can SSH into and then they have the tool chains that we use they had the environment They had everything that we basically have in production for them to play break repeat and rinse And that ultimately led to a better velocity for our developers So they have everything they expected everything they needed in order to start contributing and being effective as a volunteer Their time is precious to us. So the least amount of friction we can have the more we can get from contributions So on the cluster in a box so Switching tracks a little bit Ryan I had the pleasure of working with a really small team of people and the Linux Foundation to help design The certified Kubernetes administrator exam has anyone taken the CKA Kubernetes administrator exam. Yeah, wasn't it brutal? I'm sorry It's a really tough exam, but it's a great exam I recommend everyone here go out take the exam It's super tough But it kind of covers the gambit of things that you'd ever need to know in order to administer a Kubernetes cluster and that certification holds quite a lot of weight if you're looking for jobs or shopping around What CNCF yes in the Linux Foundation that produces and runs the exam So we had a bit of a problem when we were designing the exam in that this is kind of what you expect today When you run a cluster you've got a bunch of machines probably in a cloud somewhere Etsy D Masters workers, however you do your architecture whether you're self-hosting whether you're keeping components separate You've got VMs running So when we approached when we're working with the Linux Foundation team and the CNCF to kind of design this exam We said okay Well, we can't use just a single cluster because if the user messes up on a previous question We don't want that to impact the rest of the exam one wrong question shouldn't basically Discount you from being able to take the rest of the exam. So we said we're gonna need at least six clusters In fact, we ended up with something like eight or nine at the end of the day We designed all the questions to make sure that we had properly isolated all of these pieces and the Linux Foundation and the CNCF came Back and said that's great. However This exam is three hundred dollars So at that price point in order to run a five or six hour exam with the record was at setup times and burn and Review times that means that this infrastructure will cost well over the exam price in general so we went back to the drawing boards and I remember to talk from the last kube con this was given by Lin Sun who was formerly at IBM I think she's now at its Istio I remember this talk from Berlin where she talked about using LXC containers as a way to Do colocation of multiple clusters in a single machine? And so this was very fortuitous because we worked on this project over the summer So we started leveraging a lot of those primitives in order to drive that as let's deploy multiple clusters in a machine So the Linux Foundation CNCF said you have one VM. This is the side We're gonna run it in a cloud instance, and that's it make it work inside of there And so we did we set up an LXC cluster and we deployed all the components that you'd come to know and love They are all isolated from each other. They all have IP addresses. You can connect to them. They're all networked up It was effectively like we were doing in a previous diagram only it was on a single machine And then we were able to take that we were able to scale it up and we said, let's create just a bunch of clusters Let's destroy the IO on this machine and see how big we can stretch it at the end of day We were able to deploy quite a lot of consecutive clusters all isolated from each other all Independent of each other so that as exam takers move through the questions any wrong question against a bad cluster wouldn't affect the rest of the exam and As of right now if you go and take the exam you'll be taking it in an environment similar to this So I'm gonna have rye take us through what that kind of looks like in practice This isn't the actual exam environments It's similar in construction, but it would be disingenuous if we showed you what the exam environment looked like So this is just a using similar technologies and processes setting up something like that. So what do we have? Here right so I've SSH'd into an AWS box It's an m4 2x large It's running five Kubernetes clusters each cluster has about 10 machines For a total of approximately 50 machines So if we take a look at elixir list here, which might take a moment to come up because it's a long list what um And for for those m4 2x large what how many cores how many gigs of RAM is that it's eight cores and 32 gigs of RAM So what's the monthly if I'm just doing spot instances? What's the monthly cost on that like this most expensive? I'll see it that would be $300, okay, so this is a I forgot the cores and you said eight eight cores eight cores $300 a month a box. So that's five clusters about 50 LXC machine containers. That's right. Yep So this is a LXC list. So these are all the machine containers running You can kind of make out that some of these are the Kubernetes nodes because you can see the Docker and CNI and flannel bridges We could SSH into one of these the keys are already over there. So let's see and we can take a look at What's running under Docker we should see some kind of Kubernetes loads We can take a look at The pods the nodes cluster status So this is one of the clusters then it looks like there's three Kubernetes nodes in this cluster. It's got a three nodes of Xcd running as well and some workloads some pods I guess running Yeah, that's right cool What's the so What's the resource utilization like on this box? I guess is the next question I would have as a good question. Let's take a look So the load's a bit high right now. I usually see something around Five between five and seven, but we can take a look at each top as well So looks like the 15-minute load is around eight. So eight cores. That's pretty much about a hundred percent utilization. So that's five Kubernetes clusters across 50-ish machines running on a single instance at $300 a month. That's right cool I Imagine most of the bottleneck is not CPU, but actually disk IO or is it something else? Yeah, it is disk IO right now Cool Sweet, so that's that's an example of how we're able to do things like using LXC You can use whatever flavor of tool you like to deploy Kubernetes with Whether it be kube-adm Or the gambit kube-spray CDk etc to deploy a bunch of machines into a single cluster so from a silphrope perspective This was wildly successful for us We were able to set up a bunch of dev and test clusters on a single instance In Google's compute engine which gave us better economics And then we could basically test all of our pipelines with the same production model number of machines number of configurations Etc and then our actual production cluster was the actual where we actually spent the most of our money And this actually worked out pretty well for the Linux foundation as well cool so The last thing I want to talk about which is kind of the next Evolution for us at the silphrode is bare metal and large clusters As you can imagine we have quite a lot of code running for those of you views the site Which is none of you you should check it out if you don't that's fine, too But we have a lot of supporting services things like in a geolocation based game It's most important to help find people that are in similar areas to each other So we have lots of code that runs to help figure out who is close to you to help you achieve Certain tasks in game. We also have a giant world map Which we help display where points of interest are and other things of of importance within the game itself And so because of that we do a lot of traffic and if you've ever tried to embed Google maps or any other map Vendor onto your website. You might find it's pretty okay when you start pushing 30 million plus hits those things it actually becomes really expensive. It turns out maps and base maps are super costly And so we are into a problem where our map vendor was saying we can no longer support you at the free tier You're going to need to pay ten thousand dollars a month for utilization Considering we pull in less than two thousand a month in ad revenue That was not going to be sustainable for very long and I don't feel like paying out of pocket very much for these things So we figured okay, we'll run our own map servers. We use open street maps. There's open-source software do all of these things will take our Take our destiny destiny into our hands and we'll build and run these things ourselves Turns out it's super expensive to generate map tiles. No wonder. It's so frickin costly So we realized that running in the cloud is not very effective. We will invest in some bare metal servers We found a really decent Colo some good secondhand servers Two sockets about 400 gigs of RAM and tons of disks to store this data And then we hit our next roadblock, which is the 100 pod Limits does anyone know about this is anyone hit this yet in their clusters running clusters turns out as if we'd read the documentation that When it comes to building large clusters, there's some limits in the software itself The chief being that you can only run a hundred pods per cooblets now For those of you that running Decent capacity workloads most average web workloads don't really hit those kind of things especially in cloud instances We have like four to eight cores on average per machine or less and like less than 16 gigs of RAM a hundred pods in that space is actually quite dense But when you start looking at things like bare metal This was the average size of our server. It's a Two socket server kind of standard almost white label box We had 24 CPUs and 500 gigs of RAM and about 20 terabytes of space And if you start dividing that up and saying okay, I'm gonna run on this server is on average a hundred pods That's not gonna work very well It's awesome if you're doing machine learning or rendering farms or anything that requires tons of compute space But very little workloads, but if you're not it's kind of awful And so what we found is that well If this is the model today one piece of bare metal to one pod limit to 100 pods if we leverage that kind of same learning There's a trend here in this talk. I'm not sure if you guys are sensing it But we could leverage things like machine containers in Lexi to actually split up a piece of bare metal Because we're not incurring any virtualization overhead. I don't need to worry about installing anything to manage my VM layer I'm just using the traditional hypervisor API to say give me four machine containers There's no vert IO overhead So I'm not limiting the resources on my bare metal machine, but then I'm able to slice it up and say instead of one cooblet I can actually fit four cooblets at 400 pod limits. We went further and started playing with different architectures So we said alright Well, let's do it this way where instead of having this previous model where we JC said Each one of these was pinned to a CPU socket So to be at two machine containers per sockets and then half the RAM divided across them We went further. Let's do CPU sharing models Let's see how we can slice and dice and limit the quality assistance quality of service on these things We got up to a 900 pod limit. We never hit that. I don't we didn't have enough stuff to run 900 pods And so we actually ended up with was a mixture of architectures where we have one giant LXD Machine pinned to one socket and then the rest for more general purpose workloads that are running alongside of it And we were able to utilize about 600 pods per of one of our servers Which gave us a good sweet spot of density while still being able to leverage performance from the underlying hardware These are nodes so these are yeah, this is a cluster We have about three or four pieces of metal kind of split up in this architecture So we actually have a lot of capacity for pods, but we run on average around To serve all of our map tiles. We run probably 1200 pods Yes, it's the LXD itself. It doesn't see the bare metal. It just sees that you have in this case five Kubernetes nodes LXD containers It's less overhead. It's less overhead both in managing it and it's less overhead both in resources utilized the actual residual Computational processes used by LXD containers is managing the fractions of seconds of micro seconds of CPU time where VMs It's actually very heavy exactly Yes, and then dynamic resizing like we're able to reallocate the limits on these things like we'll say suddenly no longer Are you CPU pin just see if you share or you're now have a hard limit of memory all that stuff We can do dynamic without restarting anything. Yeah, exactly. Yes We haven't hit any We do this today in production. I know of a couple other customers that are doing something like this in production as well We just had to make sure we allowed and whitelist certain kernel modules into the machine container So things like Vlan IPv IPv6 and stuff but outside of those things we didn't have it had any problems and we view this as a safeguard for us We never touched the bare metal anymore. It's just very thin lightweight a boom to OS at this point And we're pretty confident that if we ever got an exploit because again, it's volunteers giving us code We do our best for security, but we're also not PCI HIPAA combined by any means But if a if a container did break out of confinement in Docker or whatever CRI we were using We feel okay knowing that it likely won't escape this isolation and affect the bare metal or eject bad firmware or something Yeah, we don't use live migration I've used it in the past, but we don't use it here. We just simply resize or just delete machines and create them. Oh I don't know. I've not tried Maybe worth a shot in a blog post though. Yeah, cool No, these are all connected to one cluster just spanned across multiple pieces of hardware So from our perspective, we've got it's actually just three pieces of metal in a colo It's not like a huge bare metal footprint, but we have a lot of capacity that we use So inside of Kubernetes, we actually see 15 Kubernetes nodes in our single cluster And they just happen to be spread across these it's almost as if we've gone to Amazon launched a bunch of Amazon instances We're just simply use utilizing machine containers to get the best resource utilization out of our bare metal We've got tooling that we do to automate like the creation of these things so we can repeat them So we just simply replay that automation there and then it stands up all of our little pieces and connects into the cluster with configuration stuff, yeah We use the combination of stuff and open source. It's not Super relevant. This will work anywhere. We've done it with kube-adm. We weren't quite ready for kube-adm There's some ha and upgrading problems. We wanted to make sure we always upgraded our kubernetes. So we used a combination of things like Canonicals distribution of kubernetes and some of our own scripts to manage how we do our bits So we just kind of borrowed from vendors said install stuff. Yeah Yeah, it's not at the end of my slide, but I'll add that I had a bunch of links in there I'll make sure that link is in there as well. Just some of the infrastructure code we have Yeah, LXC has a way to modify profiles So inside of LXC I can say here's a general profile that gets applied to all machines or this machine in particular has Vcpu limits these memory limits even down to network and disk IO Throttling as well. So all that's just built into the Linux kernel module for these namespaces basically Yeah Yeah It does it does a lot of this we had some pretty unique architecture stuff So we utilized a lot of the tools underneath conjure up to do that But we basically built and scripted our own way Conjure up works great. That's how we did a lot of our proof of concepts But when we went to production we we use the same tooling underneath conjure up to do that But we just kind of scripted our own way basically Yeah, so juju is one of those tools and then we bolted on some scripting and processes to kind of map how we Dynamically resize these lexine machines or says it handled in juju Yeah, so the end of the day sorry We did actually start with virtual machines initially and we found that the density was quite poor in fact, we ended up seeing about 10x more density with the same workload on these pieces of bare metal 10x probably more actually but as a result That's kind of how we ruled out virtual machines and still sit firmly in the land of containers. Just a varying range of them I think that's about it for our talk. Yeah, thank you. Thank you all for coming. I really appreciate it I don't know I Think we're out of time. Yeah, we're definitely out of time here Those links I mentioned before linux containers that org for everything LXC LXD Cncf.io certification experts all you should get certified. It's a great thing to have do study before you get certified It's like a four or six hour exam There's lots of questions about Kubernetes from both administration and usage and then that's that cluster large page I like to make sure I source it people know they always change those numbers and releases But sometimes Just to make sure you have the latest specs on how big clusters can get and I'll add their others links I mentioned to you as well. Thank you all enjoy the rest of kubecton