 Hello, everybody. Thanks for coming. We can get started, I guess. So my name is Ron Lipke. I'm a senior engineer, cloud type person on the platform as a service team at GANET. And I'm going to talk to you a little bit about our container journey. Full disclosure, this is my first time ever speaking at a conference. So in addition to dying a little inside right now, I'm also a person who stutters. Or if you're from the UK, it's stammer. So if it looks like I'm rebooting up here or stalling, that's probably why. So some of you might be wondering just who or what GANET is. We are a national news and media company with roots in the traditional print business. We're probably most recognized by our national brand at USA Today. But we also have about 130 other local news outlets in the United States that make up USA Today a network. And that brings in about 125 million unique visitors a month. These are just all the brands at USA Today a network. Maybe you grew up around one or you recognize a logo from where you live now. So what does a platform as a service do for GANET? We provide the central location for self-service provisioning and tooling for infrastructure for about 40 internal ADEV teams across GANET. And that's pretty much where they can test and deploy and then all the way up to their apps all the way up to production workloads. And this is all in the public cloud. We do have some data centers, but everything on the platform is in the public cloud. So for us, not only are we managing the environment that runs all of USA Today apps, but our customer's dev is our prod from our point of view. We have five teams on Paz. I'm on the integration team. And we are responsible for architecting and maintaining the core features of the platform. So the start of our container journey goes all the way into the deep annals of history at the 2016 United States of presidential election. So basically in the news industry, there's two types of events that you have to plan for. There's breaking news. When everyone's phone is going off, that ends up in a very unpredictable yet high volume of traffic that dissipates pretty quickly. And really your defense against that is make sure all the things are scalable. And if they're not run over a provision, which isn't the best, and cash all the things, cash in front, cash in back, protect your origin at all costs. And you should be able to handle all those new spikes that seem to be happening much more frequently. Then there's events like the Super Bowl, the Olympics, the presidential election, which have a sustained high volume of traffic pattern. And the data footprint is very different because your users want the most accurate, most up-to-date information as possible. So you can't protect your origin behind those really high cash details. And the content and the product teams are probably going to be requesting a lot of deployments to change how the data's being presented, fix any bugs or anything like that. So the faster we can do those deployments and the safer we can, the more it's satisfied our internal and our external of what customers are. So this was shaping up to be the largest night of traffic in USA Today history. And we figured why not make that our first time we run a container app. Up to this point, we had not a container on the platform. So it was new for us. It was something that we had been considering, but really hadn't planned out yet. And then our core product team came to us and said, hey, we're interested in running the elections app in a container. We said, sure. And we scoped it as kind of a stretch goal. If we could run maybe 5% or 10% of election night traffic in a container, we would call that a success. So we started out by creating a list of our requirements. This is just a short list of those. We are an old enterprise. We have data centers all across country. So we have a lot of legacy infrastructure that we needed to just be aware of that there were dependencies here that we couldn't engineer out of and make everything just a perfect cloud native app. We really had to take advantage of our existing bootstrapping, our existing features on the platform to bootstrap these clusters. We had limited time to even do this. So we couldn't learn something new, like stand up a Prometheus cluster or something. We had to use what we were already using on the platform. And we're a chef shop, so we had to use Chef to bootstrap these things. Auto scaling was non-negotiable for nodes and containers because of the breaking news thing. And we really didn't have time to write our own, so it would be better if this was inherent in whatever we chose to use. And pretty much everything on the platform is self-service. So these container clusters have to be in the same vein. They have to be a self-service with minimal manual steps required for a team to get started. And I went to FIS. And we really needed to maintain our cost boundaries and ownership on these clusters. So that kind of eliminated a multi-tenancy. And at the time, a federation really wasn't baked quite well yet. So we had to find a middle ground in there somewhere. And then, of course, we wanted to quickly iterate with the community as they're moving quite fast and just keep up with requests from our users as they started to adopt containers more. So we started out. And we took a whole bunch of sprint tasks. We gave them out to team members. And we said, hey, go take one of these things and POC it. Maybe speak with sales team or a solutions engineering team and then come back, demo it, and advocate for its use. All of these are great products. There were things that, in and of itself, they were great. But we kept finding that as we got a little bit down in the development cycle with one, we would hit a deal breaker or just something we couldn't engineer around in the time that we had. But this is a KubeCon. So it's no surprise that the winner was Kubernetes. That would be really awkward. It was like ECS. So a good part of the reason why we chose this is something I like to call the Kelsey Hightower effect. We all know who he is. If you don't know who he is, I highly recommend checking out his Twitter and his GitHub accounts. Our Google team put us in front of Kelsey after we just kept bombarding them with questions that they were like, someone else needs to answer these. And after that meeting, we came out really confident that Kube was the choice that was gonna work the best for us. So just to talk about some of the requirements I was talking about and how they informed our choice of Kubernetes, we probably spent a lot of time, probably the most time, on networking. And I wanna give a shout out to my teammate, Dane, who's zooming this now for our team. He was a network engineer in a former life, so his expertise here was super, super invaluable. And Kube has some pretty gnarly network requirements where everything has to be able to see everything else without the magic of NAT. And since we're running these in a public cloud, we really weren't comfortable with delegating a cluster, the ability to edit our network routes in AWS or VPC or in GCE. Plus there's inherent limits, you can only have about 50 routes. So we went with an overlay on network, which is pretty common. We looked at Weave, we looked at Flannel, we looked at Calico. Weave, we eliminated pre-early Flannel, to be honest. We just didn't have time, it was planned that we were gonna look at it, but we just ran out of time. And that was okay, because it turned out that Calico was really good fit for us. Its policy management was a super helpful with our app deployment strategy, which I'll get to in a slide pretty soon. IP to IP encapsulation meant we could just drop it into an already existing VPC. We didn't have to stand up anything extra or reconfigure everything. And AWS was probably happy since we're not trying to set up a BGP routes between our clusters and their routers. And yeah, so we've been doing really well with picking Calico. So we use Chef, it doesn't matter if you use Puppet, Ansible, a bunch of gross shell scripts. This is what Scaler looks like to an end user on the platform. It has a really great UI. It also has a really extensive API. And it handles all of the creating and managing of our cloud resources and supports really extensive governance policy and role-based access so we can empower our users, but we can also fine tune what they have access to. And Scaler runs on a farm paradigm. So like one cube cluster would be a farm and then within that farm on the left, you have your farm roles. And they map to the cube master, the workers, etcd, your API load balancer. And each of those farm roles has its own separate settings for instance size, auto-scaling of settings, orchestration scripts, and this really worked well with for how we were envisioning our clusters. And then kind of more on the side of non-technical challenges, we really had to architect our cube clusters knowing that they'd be owned and provisioned by just a single team. So that means we kind of had to automate and abstract all of the little things that an app team just probably doesn't want to deal with, like making sure etcd is backed up, setting our namespacing standards upfront because we were gonna have a lot of teams with a lot of clusters and that was gonna get really annoying soon. And automating all of the creation of the secrets needed to stand up a cube cluster. But at the same time, we needed to keep in mind that even though these teams have been on platform for a while, this is probably the most complex thing. They're gonna be opsing. And at Canette, we promote like a culture of shared ownership, shared responsibility. So app teams are getting alerted on their infrastructure. Like if their worker nodes are getting crushed on CPU, then that alert goes to them first. So we really had to make sure that our documentation was good. We really wanted to incentivize them to care. And part of that was also eating our own dog food, so to speak, so that we were using cube on our team and running our apps on it and kind of getting a perspective of what their user experience looks like. So that takes us to what our current cube architecture is like. And this is pretty much what happens when any time a team comes to us and says, hey, PaaS, I wanna do some containers. And we say, hold up, start here. And everyone has to complete our PaaS labs, as we call them. They basically run through all the features of the platform, basically what it takes to start using them, how they can make your team as successful. The cube lab is our longest one. It's lab 27, and it walks through the provisioning of an entire cube cluster in scaler from the creation of the farm, the farm rolls, launching the farm, and then deploying your first app. Once they do that, the second step is a requirement for every cube cluster that gets stood up after that. They submit a Jira ticket to us that kicks off a build job that builds all of the required, the secrets and tokens like the Cube CTL token, the Cubelet bootstrap token, their teams, a token for their teams, console, ACL policy, API of server search, and stuff like that. And then it puts it in our secrets backend, and then the team gets supplied with a namespace of where to find all those things at. Then they take that, they go, they build their farm. If they're a new user or a new team, they'll probably be doing that in the UI, which is lots of pointing and clicking. If it's a more advanced team, it's definitely being done automated in their CI pipeline. This is a really bad drawing of what our cube kind of looks like right now. Some things to note here are we are masochists, and we basically roll our own cube RPMs using the Google binaries. We're not running them as containers themselves. And that's for everything in the control plane and then Cubelet and Kuproxy on the workers. We run SED in its own cluster. There's usually always a three-node, a quorum, but for some of our bigger clusters, we go up to five. And then we put AJ Proxy, which is heavily orchestrated in front of all the worker nodes and then that gets updated with where to find the API server and anything that needs to talk to the cube API goes through that AJ Proxy. So we also have this really awesome API team on Paz. They build all of these great tools and go. And they had built a deployment API that was initially used for deploying apps on cloud instances and managing that entire lifecycle. So they built some new functionality into it to handle container deployments on a cube cluster. Basically it's managing three key things. It's taking a cube deployment object. So anything you could do in a cube deployment you could do in our API. It just takes that, it deserializes it and sends it to the cube API. It creates a service to go with that deployment and it picks a node port from this list of node ports assigned to that cluster. Then it goes and updates the AJ Proxy config with the IP addresses of all of the worker nodes in the farm and the node ports where you can find the currently deployed version of your app and the new version. Then the user can use the API and start shifting a traffic however they want and however they are comfortable with. They could also do a canary deployment. They can just shift a little bit of traffic and then do some automated testing with a pass-fail mechanism in it. Once they're okay with the deployment and they say it's good to go, they send a complete message to the complete endpoint and the API will go in and delete the old deployment and services and update AJ Proxy at config again. Or if it didn't go well, they could ask for a rollback and it'll just put everything back to the way it was before you started the deployment. And this is really where Calico shines for us because the AJ Proxy has no idea what node is running a pod for the app you are looking for. So when it sends traffic to a node that doesn't have that port on it, Calico will come in and take it and send it to a node that does. So a cube is really good at abstracting all these things from people who just wanna run container workloads. But I'd say that a benefit comes with a lot of added complexity from an operations standpoint. There's a valid argument also that like whether or not taking on that added a complexity is worth it at this point where we're at to run some containers but it's the only way right now on the pass a platform that you can run containers. So we kind of have done the best we can to build up a tool set to handle those issues when they come up and dealing with that complexity. One of those things is we run a server spec in RCI, CI a pipeline against a fully running cube stack. We're checking for making sure that the system is configured right and we're also checking actual functionality. We're running a commands, we're checking the output of it, we're making sure everything in the cube system namespace isn't just saying it's running but it's actually also working. Then we run some canary clusters. So any changes that we make or new releases or if we update things in our cookbooks, we run them against a cluster that actually has apps running in it to validate those changes, check for any regressions. It's not like perfect at catching everything because we can't account for all of the use cases that our users are inventing in these clusters but it does a pretty good job. We maintain a set of runbooks, if you're not doing this, I probably should suggest it and not just for like the new person's first night on call. It's really good just as a reference for the team. We have everything in there from common troubleshooting techniques also links to documentation, they are continuously updated anytime we encounter something that's just weird, it goes in there and they're all in a GitHub repo and then we link those to all of our new relic alerts and our victor ops like our paging services alerts. And then these last two things are kind of just like standard ops things. I think in the keynote yesterday morning we heard observability engineering, I kind of like that term. And this is kind of like in a cube cluster which is different from our normal stuff, it's very dynamic and it's volatile, things are being spun up and torn down all the time. So we shove all of our cube logs and our container logs off to our logging vendor. This is also where it helps to have some good name spacing and patterns up front. And then we've designed and created some pretty nifty dashboards that our success team has even automated as part of the provisioning step when you're building a new cube cluster. This is just an example of what one looks like. Just standard worker stuff, CPU disk memory and then just checking what all the important services on a worker are still there like Docker and a cube proxy and cube load stuff. And then we look at things like how many of a container and a version of that container are on all the workers. We look at the container of distribution that gives us a little bit of insight into how the bin packing is working and then just some random IO stuff over there. So we did run into some challenges and one of them was we had problems with nodes at terminating and terminating cleanly. Not only draining pods but removing themselves out of a calico. So we didn't have routes to nodes that weren't there anymore. We handle this by creating an orchestration script that Scaler will trigger before a worker node is removed and terminated in its cloud provider. And all that's really doing is it's hitting it's hitting at the cube API for its own node name. Then it drains itself, I think we have 120 seconds there. And then it will delete itself and then it will stop a calico node on that worker and then delete itself from calico and that runs every time. Unless the thing just dies and then you have to go in and manually do that. We had some trouble with auto scaling. There's like inherent ways that I do this in Kubernetes or you could just use your cloud provider. You know, if you're in AWS you're using auto scaling groups in GCE or using managed instance groups. But there are chances are that you probably want to do things after the node is there and it's like post config management, post boot steps. So for us, we didn't want to use user data or user scripts through metadata in GCE. So we created our custom scaling scripts in Scaler. That Scaler will run on a worker node at a set interval and then based on the output of that script it'll make a scaling decision. Initially on our smaller initial clusters we were taking the average of those metrics. So when we got to the larger clusters we started seeing that there were a few worker nodes that were just getting crushed even though everything else was fine. So we rewrote those scripts to scale off of an individual worker's resource consumption and then we made those both available to our users. So whichever works better in their cluster they could choose. We hit some issues with contract limits. This was kind of a race condition in our testing because we were disabling a firewall D in Chef and also in our pre-baked images. But we knew that a Calico was running its own firewall D pretty much but it was also reloading the net filter contract kernel module. So it was putting back into play those CCTL contract limits and we started seeing just random nodes in our larger clusters just dropping our TCP connections. So we just bumped all those limits and added some pretty detailed alerting and monitoring to our dashboards. Cloud parity as much as CUBE wants to be like wants to abstract how resources are provisioned versus how they're consumed by a user like persistent volumes. A user should just be able to say I need a 25 gig volume here. It shouldn't care if that's an EBS volume or if that's a GCE, a persistent disk or it's NFS. But there's currently some issues here where we're getting bit by an open issue where if you set the cloud provider equals GCE Hublet flag which is required to get any of the GCE specific functionality like persistent volumes. And you're using a CNI plugin. It will still try and create routes in GCE and then you'll have nodes come up that will say they're ready but there's no routes to them. The workaround right now is to not set a cloud provider if you're in GCE which kind of doesn't work because then we can't use persistent volumes, then we can't use a staple sets. So having to tell our customers at this point that you can use these over here but you can't over here is not good for now but it's supposedly supposed to be fixed soon and that would be awesome. And this is probably what's been the most challenging for us. Our initial cube clusters were one four. We released one five, skipped one six, released one seven, planned on doing one eight and now we're looking at one nine probably in Q1 for us. If we were just like a team working on just this we'd probably be able to do it but we have a whole other platform to manage and we're also integrating features into our cube clusters like our vault integration, our console integration. So we're trying to get better at this and we're looking at some alternatives that can help us out with the pace of the development here. So how do we do? Well I'm still getting paid and they let me come and talk about this so I imagine we did pretty good. Elections at 2016 was a complete success for us. Broke all our traffic records. About a week before the election we went to our CTO and we said we wanna run 100% on cube and he had the faith and the confidence in us to do that. And I think at one point around midnight one of the API layers behind the application was having some issues. So they just threw it in a container and we deployed it on the cube cluster and had no problems. We managed over 170 deployments that night, all successful, which we couldn't have done if we were running this on the old VMware. So it was a long night for a lot of reasons but it was a rewarding finish to a very challenging project and a very transformative project for us. And I just wanna talk about this up real quick because this is a little bit more recent. So our success team recently took all of the desktop sites for all of our properties and moved them into Kubernetes cluster and that resulted in reducing our daily operating costs by a several hundred bucks and they're looking at further optimizations there. Our deployment times went from fully optimized getting all 140 sites out. Took over two hours and now we're down to 25 minutes and we could probably get even faster there and during the recent tragedies of Hurricane Harvey in Houston area and Hurricane Irma in Florida, we dropped our paywall for all of the local sites in those areas so all the residents could get the most up to date and accurate information that they needed to stay safe and informed. We ended up making 1,500 of deployments during that time period which again we could not have done the old way. So that was a pretty big win for us. And real quick, what's next for us? We're eventually gonna finish our move at Etsy D3, Cube 9 like I talked about. We're re-looking, we're taking another look at Cube ADM. When we first started this it wasn't really ready and now it looks like it will help streamline how we handle our control plane with those gross RPMs. Vault integration is actually live now. We're using the Kubernetes Vault off back ends in conjunction with a pre-configured role for that team, a policy for that team and then a role for that cluster. If any deployment needs to access a secrets in Vault you just add a Vault init container to that deployment and then it'll mount a token with a very short TTL in a shared secrets volume for any of the containers in that pod to access. We started looking at service mesh, we think it's really cool, we really wanna do awesome things with it but nobody's really asking for it. So it's taking a backseat to some higher priority stuff and last but not least the elephant in the room are terrible, not terrible but are terribly over orchestrated HAProxy per application ingress load balancer solution that we're looking at some cloud native stuff like using ALBs in Amazon or GCE load balancers or maybe something like Envoy but that's where we're at. So thank you for listening to me. If you have any questions, feel free to ask or find me or Dan or we're also hiring. So if you wanna do cool, interesting things in the news industry, come talk to us and that's it, thanks.