 Can you guys hear me now? Is that better? All right, sweet. Yeah, so thanks for the introduction. Happy to be here talking about infrastructure and containers. So as most of you already know, containers are really becoming the standard to build, deploy, and manage your infrastructure. They're really easy to get started with, with a few kind of command lines you're unrunning. But as you start scaling up, as you start adding more containers, the complexity kind of quickly arises. So this talk is kind of focusing on where these complexities come from and kind of what are the different ways to start handling containers at scale. And unfortunately, this is not going to be the holy grail or one solution to solve all of your problems. But hopefully, this is a way to kind of give you the questions to ask yourself as you start scaling up. What are the sort of priorities in your use case and kind of the different options that you can kind of choose based on your priorities? So I'll do a quick interview with containers. I think most of you are aware. So I can probably just breeze through that, go through container clusters, go through some of the complexities as you start scaling up, and then go straight into kind of the different options and the different kind of pros and cons to each of those. And then hopefully I'll have some time for questions, maybe not, but come talk to me afterwards and we can go over through there. Yeah, so container is going to be your runtime environment containing all of your application and your application's dependencies. And the main point here is the fact that the container engine, most likely Docker, both is sitting on top of the operating system and the infrastructure. So this is great. So now your containers don't have to, they're abstracted away from all the underlying resources. But in some cases, you still have to interact with infrastructure. If your containers need to access specific resources, such as maybe GPUs, you will have to interact. And then as you scale up, you add other pieces of infrastructure. The actual infrastructure will have to communicate with each other to allow the containers to kind of network. So there's extra kind of complexity going there. So yeah, if one container is great, multiple containers are even better. So as you start scaling up, there's different use cases. So simply as you scale up, you want to be able to handle higher concurrency, higher throughput. So you will be having some sort of load balance application. So the load balancer will need to be aware of the other containers, know what port, what IP addresses and such. So there has to be some sort of sharing of information. And then also as you use microservices, you'll have lots of different containers kind of performing different actions, but you want to be able to limit kind of the policies, kind of the privileges that each of them have. So the UI, you want it to be accessible by everyone, whereas the database, you really want to lock down. So now you have multiple containers and different pieces of infrastructure that have different permissions, different routing rules and that sort of such. Yeah, so kind of thinking just locally, you simply want to have a simple load balance application, maybe two flash apps. You will have to kind of specify different ports since you can have multiple applications right in the same port. And then you can have some sort of load balancer such as HA proxy. However, this load balancer will take in a config file to start with and you'll have to be aware of the different IP addresses and addresses of your kind of applications or your containers. However, this is not very dynamic since you will have to kind of put in the port numbers you'll have to know beforehand. So in this case, you can kind of spin up for this case, but you want to be able to dynamically scale up, scale down across infrastructure as well. So in order to do this, you'll have to be able to kind of discover when new containers are joined in the cluster. So you'll have to have some sort of service that's monitoring, looking on specific tags or specific ports, and then some sort of way to register the service, be able to health check the service. And then also update the load balancer, provide it with the new container ports, the new IP addresses, and also various rules such as what you want to be able to route to that container. So you can do it with services such as Registrator from Glider Labs. Console can be your service loader and then some sort of way to reload the load balancer with the new config. You want to be able to do this without interrupting traffic, so you might need to have multiple load balancers as well. And the point I was trying to make is simply just trying to load balance two containers is not quite that simple. There's a lot of extra pieces that you're already incorporating into your system. Now you have to kind of monitor these services, update these services and stuff like that. So the complexity is already kind of increasing simply by just trying to load balance a couple containers. And then that kind of brings me to the next point. What makes scaling so hard? So as you saw before, kind of the networking, being able to dynamically change the load balancers routes, being able to load balance between multiple containers, and then also on the infrastructure side. So you want to be able to utilize more and more resources for your containers. So you need to provision infrastructure on demand. The infrastructure will have to communicate with each other so they have to be aware of each other. And then also you want to be able to balance these clusters. So you have to have some sort of monitoring service to understand how much resources of each pieces of infrastructure is taking. And then so that's another kind of service that you'll have to have access to all the pieces of infrastructure. And then there's other fun things such as updating. How do you update your kind of cluster as often as possible automatically? And then even scaling down. So now you've worked so hard to finally balance your cluster. If you're not utilizing all the resources, you want to scale down. So you have to unbalance it, move containers, make sure you're not interrupting user sessions, not interrupting any kind of privileged jobs before scaling down. So you have to kind of manage the balancing and unbalancing dynamically. So a lot of this stuff you can do manually fairly easily just through like command lines, SHQs, or SSH inting machines and stuff like that. But the hard part is trying to make this automatic as possible and dynamic as possible, reducing the need for ops so you can kind of focus on other things. Since the reason you're using containers and trying to scale up is to focus on your product, your company, not as much on the infrastructure and provisioning side. And then obviously there's like the monitoring that comes into play as well. So now you have a pretty complex distributed system with many containers across many different pieces of infrastructure. This infrastructure can be across regions to increase resiliency, but also makes it a lot more complex if you try doing fail-overs and that sort of stuff. So now you have distributed logs. You want to obviously probably want this in a centralized place to analyze, to be able to figure out where failures are happening, if you're doing some sort of AB testing or rolling updates, you want to be able to figure out exactly which container, which version and kind of what the problems are and figure out how to kind of diagnose the problem as fast as possible. So now kind of going into the various options, focusing mostly on the cloud, but a lot of the concepts will kind of also refer to on-prem or multi-cloud or on-prem cloud solutions. Maybe the provisioning itself will be different and how you manage the actual resources. So first, obviously, is there gonna be like the Docker cloud or Docker, Docker compose more of that naive way of doing it. So simply when you spin up your compute, you can run the same commands that you did locally. So you can install your container engine, Docker, run the same Docker scripts. So this is very simple to start up with. Don't need an orchestra. I don't need any sort of extra software, but obviously it's not gonna be really scalable or anything else like that since all your containers are gonna scale up the same way. It's reliant on the actual compute instances of scale up. You're not scaling up containers. So it's gonna be a lot more expensive, a lot slower. So this will be good for testing, good for kind of seeing how your containers will work on maybe some cloud infrastructure and how you can maybe do some basic scaling, basic networking. But yeah, it's not gonna scale very well. If you have multiple microservices in the same compute, one of them fails. There's no really way to automatically or dynamically kind of fix the problem. You might have to SSH into the instance, reboot the container, or it's termate the whole instance, spin up all new microservices, which is gonna be expensive, add latency and that sort of stuff. So again, this is gonna be the naive version, easiest way to spin up, but not gonna scale very well. Also, don't just put one microservice or one container in each compute. Like to be the main purpose of the containers, to be flexible is what you're trying to avoid anyway, so just don't do that. And this is kind of where container orchestrators come from and these are gonna be the Kubernetes, Docker, Swarm, or actually Nomad. And these kind of address a lot of the problems you saw earlier on. So with service discovery, load balancing, and like all the networking, all that sort of stuff. And they all kind of work the same way. They have pretty similar architectures, maybe the underlying tech underneath is gonna be a little different and maybe they'll abstract some pieces. But essentially the architecture will be the same. So there's gonna be master nodes, worker nodes, and then the application service layer. So the master nodes and worker nodes are gonna be separate pieces of infrastructure and then the container service layer is gonna be your good old Docker containers. So the main takeaways here is gonna be like the master nodes. These are gonna be the heart and brain of the cluster. If those fail, your entire cluster will fail. And this is where a lot of the logic of the routing will take place. This is where all the service discovery will take place and kind of where all the state of your cluster is gonna be held. And then the worker nodes are gonna be where your containers are actually deployed. These will be the ones that will be scaling up, scaling down. And they also need additional software such as being able to connect to the master nodes, being able to do service discovery to understand what containers are running on those instances, what ports those containers on, the health checks of those containers and let the masters know. So the masters can kind of keep this up to date. And okay, problem solved, maybe. The problem is this is a lot harder to actually do than it sounds. And a lot of the reason is because this Kubernetes or these orchestrator layer is abstracted from the infrastructure. So as you saw from earlier, the container engine sits above the operating system and infrastructure, which is on purpose. But that means it's on you to provision the infrastructure, install the software, manage all the updates, managing the scaling up, scaling down and all the networking. So you can actually install Kubernetes on a cluster of Raspberry Pis. It works perfectly fine as like any cloud or on-prem solution. But there's a lot of pieces involved. So the hardest part is gonna be probably the master nodes. So as I mentioned before, this is really the heart and brain of the cluster. So generally you want to have at least three of these nodes, five, any odd number since they work on a consensus basis. So as a distributed cluster, essentially the master or the majority kind of overrules the minority. So if one master node is disconnected from the other two and rejoins, it has outdated data, the majority will fix whatever is kind of not in sync. And the whole cluster will be in sync. The problem with this is if you have your masters, if you lose the majority of your masters, you lose your state of your cluster. You can lose track of where your containers are. You can lose track of the different IP addresses of all your other cluster, which requires a lot of manual fixes. You'll have to manually update these masters. So the failover is one of the more difficult parts to manage as well as updates and those types of problems. So the reason, well, at the minimum of three, if you have one, obviously if that goes down then you lose the entire state of your cluster. So three is generally a good number. The higher up you go, be a little harder to manage, but it'll be a lot more expensive, but you get more resiliency. So it kind of depends on how resilient you want and just managing costs there. And then I guess the hardest part is gonna be trying to update. So you can't just obviously spin them down and spin them up with the newest updates. You have to manage it kind of slowly. So you have to probably spin up a master, have it sync with the rest of the cluster before terminating an outdated one and then kind of redo that process. So if you make any sort of mistake, it'll be out of sync and then you'll have to manually go in, kind of reconfigure, figure out the state of the cluster and update the masters with the correct state. So again, there's lots of ways and lots of papers on how to do this. The hardest parts can be able to do this dynamically, be able to do this, I guess quickly as possible with as little intervention as possible without having to have a whole team kind of manage it, being always able to monitor and kind of keep the status of it. And this is kind of where the managed Kubernetes services come from. So these are gonna be from a lot of your major cloud providers. So such as Amazon's EKS, Google's Kubernetes service and Microsoft's Kubernetes service. And there's kind of subtle differences between them. So the Amazon will manage only the masters for you. So you'll still be managing all of the worker nodes, installation and connecting to the masters. And then there's an extra fee, I think, for using that. So obviously they'll take away the hardest part for you, but there's still a lot of scaling. You still have to install all the software as be able to update the worker nodes. But they'll handle the balancing, they'll handle all the master part. So Google and Microsoft's versions of this software, they actually manage all the infrastructure for you. You manage the scaling. So they'll manage all the masters for you so you don't have to worry about ever updating those, updating or any of that sort of stuff. You just kind of tell it how many worker nodes you want, that's different sizes. Essentially you tell it what compute you want and they'll provision it for you. They'll give you, and they'll connect it with the master. They'll update the software and all those sorts of pieces. And the nice thing about those is there's no extra costs. You're still paying from the underlying infrastructure as you would normally do if you were using just like the basic resources underneath. So it takes away a lot of kind of the problems of provisioning and installing all the software, a lot of the connections and that sort of thing, which is nice. But obviously like there are gonna be some drawbacks. So some of the drawbacks are gonna be, you're locked into one of those vendors, you might not be as flexible, you might also still need to be on-prem on the cloud. So it makes it a little harder to navigate that. And also if you need sort of third-party add-ons or other or custom solutions, they're not gonna be as flexible. So you might still need, if you have extra security needs or extra sort of monitoring, are there those types of things that might be a little harder to integrate with those? And yeah, so for quickly with like the Google, Google's kind of managed Kubernetes service, with the click of a button, you'll create the cluster and the cluster behind the scenes as they'll manage all the master nodes. As they dynamically spin up, they'll share their IP addresses with each other and join themselves in a cluster. And once those are available, however many worker nodes you wanna add, they'll be added directly to the cluster behind the scenes without you having to worry about that. And then deploying your containers and creating the service is gonna be pretty similar to the Docker compose or whatever the Docker kind of, or whatever container framework you use to deploy your containers. Generally gonna be a YAML file, you just kind of describe how you want it to run, how many you need, and then all the routing and all the scaling will be done by the masters, which in this case is gonna be actually managed by Google or managed by the cloud provider. And then there's also the semi-managed container services. So this is more of a little bit more of like the legacy stuff from Amazon. So this is Amazon's elastic container service. I think Microsoft might also have one. And so what you get from here is they manage a lot of the container scaling, a lot of the load balancing, but you're still managing your compute. If you need to do anything more advanced, you need to share information between containers. You need some maybe special configurations. You'll have to do all that manually and kind of manage that yourself. So this is gonna be useful if you have maybe a simple API or a couple containers that you just want to not worry about the scale, not worry about a lot of the basics. You just want to kind of have hands off and you can have some sort of auto-scaling infrastructure that installs the basic software and then just point to whatever load balancer that Amazon will provide and then it will kind of scale out from there. Yeah, so yeah, in conclusion, I kind of wanna go over as briefly the different things we talked about. So I guess the talker composed, this is gonna be more the naive way of going about it. So this is great for basic testing if you're switching from different environments before you even deploy to maybe to your dev environment. You kind of just can set this up, it's a quick and easy way. Next are gonna be the container orchestrators. These are gonna be the most complex. You'll be managing everything from the infrastructure to the kind of container deployment layer. And the hardest thing about the infrastructure is gonna be making it dynamic, making it as least amount of ops as possible. So there's tons of like tutorials out there of how to set up your Kubernetes or whatever kind of cluster you're doing. But to be able to kind of update dynamically, scale up and down dynamically is actually where a lot of the difficulties will rise. And especially trying to do failovers and other types of kind of disaster recovery is very difficult with these since as soon as you're out of sync, as soon as these elastic, and also depends on whatever provisioning or provisioner tools you're using. So there are things like Terraform, Ansible and those types of things. They're not great to dynamically discover each other. So you'll have to add additional software, additional tools to kind of, and other use such things as like tags, metadata for them to actually join in clusters and that sort of stuff. That's kind of where the managed container orchestrator services have come from. And these are pretty recent. A lot of them, a lot of the bigger announcements have to spend this year. And this is they're managing all the kind of difficult parts of the container orchestrator layer for you and a lot of the infrastructure provisioning and scaling, letting you kind of focus on just deploying your containers, specifying how many containers you want. And it's much easier to automate. You can have some sort of simple monitoring system to monitor the usage of your clusters. And after a certain threshold, you can call some sort of API and start scaling up your cluster and then behind the scenes, whatever cloud provider you're using will kind of manage that for you. But then again, you're gonna be tied down a little bit to a specific cloud provider. They do allow for you to kind of extend the cluster so those masters can actually connect outside of that cloud provider. So if you wanna have some sort of on-prem cloud provisioning, you can do that. It will be a lot more work than just simply sticking with the cloud. But there are ways to do that. So you'll have to have your team kind of figure that out. And then there are more that semi-managed container orchestrators. So you'll have a lot more limited options. Some of the times you will have to still manage the compute, the provisioning, install some software to actually connect to the cloud provider service. And this is gonna be good if you have some simple applications you want just be able to auto-scale, not worry about just know it's gonna always have handle the concurrency of your users as your kind of demand scales up and down. And I guess like the main thing to think about is how much control do you want to have over your clusters, especially with your infrastructure provisionings. Depends on kind of what your experience are, how big of a team you have, and kind of how much you actually need to configure, how much you want to kind of work on maintaining and building these systems. And especially the more you scale up, the more kind of moving pieces there are. And the more work it will be when there is some sort of failover, some sort of thing that requires manual intervention. I think there was a talk earlier yesterday that said that things in the cloud always fail. So you can't just expect it to always run and be like, oh, look at perfect, I can spin it up. They all add, they all work, it all works. As soon as you hit your first sort of failure for sort of update, a lot of the automation, a lot of the dynamic stuff you kind of provision can fail and it requires a lot of work to get back into sync. So it kind of, yeah, again, it just depends on how much, engineering resources, specifically, you want to time dedicate versus how much kind of unique software or how much you want. And then I guess the final takeaway is this field is changing so fast. A lot of this stuff has been pretty new in the last one or two years. So a lot of the stuff I might be talking about may be obsolete next year. So always just kind of be aware what's out there. There's a lot of extra tooling, a lot of people working on a lot of these problems that make it so difficult. Yeah, so always kind of be aware what's out there and make sure you keep reading up to date. And with that, I will open up to questions. I'm not sure how much time there is. Yeah, you're good, yeah. Right. Yeah, if there's any questions, my email's here, a few of any questions afterwards. You can send me email, send me feedback on the talk or anything else. And yeah, thanks for coming. Thank you. Yeah, okay. I was just wondering if you had any experience using K-Opps or any of the other open source like Kubernetes or other cluster management tools versus like the Amazon or Google? Yeah, so I've used a little bit of K-Opps and stuff. And I find generally those are a great place to get started. Again, with a lot of the dynamic nature, so a lot of times when the failovers happen or when you need, when some sort of things fails, maybe like the DNS or some sort of outage in a certain region or something like that, that's generally where those kind of tools are lacking, it's harder to recover. But I think generally for starting off for actually getting everything provisioned, it's a lot faster to get started. So if you're trying to kind of evaluate a POC, like is kind of a container orchestration layer, something I need, it's definitely a great place to start versus going through like the 50 steps to set up like your Kubernetes cluster by scratch. It's everything from installing the binaries to working on the networking to creating the certs, to passing the certs, so a lot of the security. So yeah, I think that's a good point. I probably should mention that, but that's a good middle ground between cloud provider doing it from hand, I think. But yeah, thanks for bringing that up. Hi, I use Kops to orchestrate my Kubernetes clusters. What would be a good way to do like set up DR if you have multiple Kubernetes clusters and you wanna back up like all of your configs and Kops configs? Yeah, so the hardest thing about DR, so you're thinking like cross region here is when there's a disaster, it's most likely your master nodes kind of getting out of sync or completely down. So you need some sort of way to either back up or have maybe even a whole another cluster of at least like the master nodes are ready to spin up that has some way to kind of share the data to share the kind of state of the cluster. If you, yeah, so I think a lot of places simply just have two kind of complete clusters or at least like two full sets of master nodes that are kind of running in parallel and then be able to kind of switch from one region to another. But yeah, that's gonna be like one of the hardest things to manage. So I think a lot of the cloud providers just actually have like multiple kind of clusters up and running just behind the scenes if you select some sort of DR node. But yeah, you're gonna have to do at least some a lot of things in parallel. So can you use like SCD for bringing back your cluster if say your master goes down or everything gets corrupted? Yeah, so I think behind the scenes I think actually Kubernetes does use that. So yeah, you'll have to, if you wanna try to automate it you can write some scripts that you have to utilize SCD and stuff behind the scenes to kind of figure out the current state. So a lot of the times it requires manual work, but yeah, if you do some things behind the scenes you can write some scripts and automate that as well. That's another good option. That's probably the place to start if you wanna try to automate like a disaster recovery. I'm just wondering what your, did you have any experience or thoughts on OpenShift? Like specifically how OpenShift took a secure by default stance, which is in pretty stark contrast to the Kubernetes defaults? Yeah, so yeah, unfortunately, I actually don't have much experience with that. But yeah, I know Kubernetes, that's kind of one of the problems with it is, yeah, a lot of the security, a lot of the policies and stuff is pretty open. So it requires a lot of third party applications, a lot of additional kind of configurations. So that's why a lot of people will try to configure it from scratch and then have to kind of patch on a lot of the stuff. I think they know that's a problem. So there's probably gonna be some sort of, they will probably address that pretty shortly. So, I mean, that's the problem, everything moves so fast. So it depends on kind of how long you can wait, kind of specifically what your needs are. Cause a lot of the times, everyone's just kind of playing catch up. They realize there's something out there that everyone's kind of going to that tool, specifically for that reason. So as soon as they bring that over, then it kind of changes. But yeah, sorry, I don't know too much about OpenShift yet. Okay, thank you very much. Thank you.