 Hey, thanks for coming to my talk. I tried to go for the longest title award, I think, this year. I'll see that in friends, you see NCF landscape to run high-traffic, dynamic, scalable, cost-effective websites. I also just went for all those great buzzwords, right? My name's Corey McGalyard. I'm an engineering manager for a company called Red Ventures, probably the largest media company you've never heard of. We own CNET, but also quite a few other large websites. You can go check out the website later to learn about us, but I've been with CNET specifically for about three, three-and-a-half years now, and CNET's been around for a while. It started in 1994. It was originally a TV network. A lot of people actually realized that we started with the focus on TV and then shifted to web presence afterwards. We see about 45 billion requests monthly at Edge and 10 billion requests at Origin per second, which is probably a metric we're more comfortable with. That's 20,000 requests a second in five to 10, depending on the metric you look at down to the actual containers at Origin. CNET has a rather long history of acquisition. Starting in 1994, they started gobbling up brands and then was purchased by CBS Interactive in 2008. When I joined CBS Interactive in 2009, they were being acquired by Com, and then we were sold to Red Ventures a year after that. We also have quite a bit of technology leadership. Over the years, CNET, because we've been around and have developed stuff, we actually developed Solar, the search index, and that was donated to Apache. We also were active with Mootools, and then we adopted Docker, and Docker Swarm actually really, really early. I've seen references back to 2015, 2016 era when that started to kind of take shape. And then also, the links are here for most of these are either talks or documents about the history, but we've been pretty open about our Google Cloud migration over the last three years. So why Kubernetes, right? So this is a talk to really kind of talk about why we made the decision as an organization to move from Docker Swarm to Kubernetes. We started about two and a half, three years ago, really initiating the process of doing this. And the first few things, I think everybody will agree with these things, are really comfortable with. We see it is a super flexible platform, has all these tools available to it, and it was really high industry adoption. For us, we were really comfortable with containers, we've been using them for years, so the step into Kubernetes world was actually really, really smart and easy for us to do, right? But then, what really kind of sold it for us is managed Kubernetes. We're in Google, so we get what I consider the cream of the crop of the managed Kubernetes world, gives us visibility and control. I'll share with you the difference between what it was like to live in a Docker Swarm world and a managed Kubernetes world today. Additionally, I'm going to get some groans probably from you guys today. We like the managed Kubernetes for click ops ability and I'll kind of share why in a bit. And then, when we first started moving into GKE, the GKE team released a version of Kubernetes called GKE Autopilot. The day it released or it was announced, I spun a cluster up and started throwing some workloads on it in it to see if it would support our workloads, and it was something that we were really, really interested in, primarily because we can look at the Kubernetes API as kind of a service, right? And we can just throw a container at it, and we don't have to worry about the nodes or managing the control plan or anything like that. It really just gives me the API capabilities that I like, but it doesn't add complexity to our general workloads. And we can also interact with the CNCF ecosystem, which is probably why we're all here, right? And additionally, why Kubernetes developer and customer experience. Our engineers aren't infrastructure engineers. They just want to write node code and be able to submit containers to an API and give them a really consistent response, but you also see how flexible we've set it up to be over the last couple of years. So this is what Docker swarm looks like if anybody's never used it from a node perspective. The barrier of entry to interact with swarm meant that an engineer would have to SSH onto a manager node and run commands against it, right? So we could get really good information, we could get the containers that were running, we could get the images they were running, how many containers were running and stuff like that. It was visible, but for our general day-to-day engineers, they're not going to want to do this. Moving to Google, we were able to give them a really nice dashboard, free out of the box, right? So we can see the containers. This is actually the front end of CNET, by the way. We can see the containers that are running the front end. You can see the resource utilization requests. We're over provisioned pretty heavily in Pride because we want to make sure that you all get really good experiences as you hit our site. And then you can also really easily get to logs. I can't see my pointer, but up there. And then inside the containers themselves, you can see not only the entire deployments logs, but the container logs with just a click of a button. But additionally, you have the ability to modify your deployment really easily. We don't have permissions to production by default. We have to request that, right? So even me as a platform engineer, I have to request those permissions, but what this has allowed us to do is to give us a really easy way for our engineers to make the same request and then adjust scaling on scaling events. Like if we hit a Black Friday, Cyber Monday, we can prepare for that. Or if all of a sudden, an iPhone gets announced and we're getting more traffic than we anticipated, we can very easily make a change and give that control to our engineers. This is the next step really kind of, that's like the way that the point behind why we chose to go towards GKE specifically. But this is our CNCF tool chain and a couple other open source and internal tools that I like to talk about today and we'll get into those details in a minute. So let's kind of put that together, right? So what happens is like you'll start with a Slack message to say, hey, can you take a look at this ticket and then we go to the ticket and it might not necessarily be the ticket we want, but it's the one we deserve. We validate that it's on April Fool's joke, business comic sans, right? And then we make a commit, we push set up to get. And we're a jink and shop, that's probably my second moment for now. And we based on the branch and the sandbox name, we can request the Femoral Environment or a sandbox and build it out. It gives our engineers the capability to very quickly compare the front door of CNET to their change. It takes all of five minutes. What's nice about this, this isn't running locally. This is running in a Kubernetes inside of Google Cloud and then I can give that link to a product manager. The product manager can then validate what they requested us to do and say whether or not it's what they wanted. So what makes all this possible? The first thing are the Kubernetes objects. We should all be pretty familiar with those. We'll walk through them kind of quickly just to kind of share how we kind of structure things. So we'll have a deployment that obviously spins up pods. In this situation, I'm just wanting one Node.js pod next to an Engine.x pod. That's pretty common practice, right? And then the next thing we're going to have in between those pods are services. The oddly named Node service is an artifact of us using Docker swarm and Docker compose and how you name services in Docker. And there's configuration typically, or so sorry, so that's why it's named that way. And then the ingress that allows traffic to come inside the cluster and flow through the services to the pod. So kind of a drawing wise, this is kind of how it works, right? The request comes into the ingress object and then it's passed through the service on that ingress object. And then Engine.x is configured to look at Node and pass the request down. And then Engine.x and Node can scale up and down as we need. This is our typical kind of simplified ingress definition. I'm gonna walk through all those annotations, but that's really the power behind the CNCF world, in my opinion, is how we can quickly pull in different tools and augment the native Kubernetes APIs. And then also there's config maps and secrets. This is showing a secret for a TLS certificate. You can see where the certain key typically would be, I shorten them so that they would fit on the screen. But also a secret typically sit encoded on disk and not necessarily in plain text, right? We're all inside of the STD. So that's how Kubernetes does its things. And these are our internal open source and CNCF tools that we're gonna kind of walk through, the first one's Helm. Helm for us really is obviously a package manager, right? So anything that we run inside of our clusters kind of gets bundled in a chart and helps us kind of move those deployments into the cluster. We like it because of the deployment history. We have jobs that allow us to look at the deployment history and roll back. But what I appreciate a lot about Helm, it doesn't just capture the container image change, but it captures anything that changes in the manifest itself. So if I make an ingress change and I work it, right, and I have to come back, it's really easy to pull back that config change. And then also gives us some flexibility in the manifest themselves. How many people are setting out image tags today, right? So that's been a pattern in the past to where you would set up a deployment and you would change the image tag based on using a said command. Helm has given us the ability to really easily pass values down into our manifest and make them really, really flexible. And it gives us the ability to have the multiple environments based off of one chart. It also gives us logic, gates, and loops. So in some situations, we want our sites to be external public to the internet, right? In some situations, we don't. And that having a true fall statement is really easy to do to get that set up. And then looping, if you want multiple host names, we have multiple host names all over the place because a different path we like to take. And I don't have any of a really easy way of demonstrating Helm. But traffic ingress controller is something that we've used since we've been in the Docker swarm. It was really, really flexible and easy for us to use there. But we brought it over into Kubernetes because it was familiar. And it was one less step we had to take into the Kubernetes world, right? It's super, we use it similar to the way InginX is used to where it's host and path based routed. So once a request gets to the traffic ingress load balancer based on the host that's passed, we're able to pass it down to the service that is specified. And it also supports Cert Manager and external DNS in GCP load balancing. We can kind of see how that works here. This is the same ingress I showed you guys a minute ago with our friendly red boxes, right? And so we have the ingress class of traffic that identifies, hey, this ingress is associated to traffic. And so traffic starts listening and paying attention to this ingress. And based on the host name that is set, traffic will traffic the ingress controller or route traffic to the correct service that is listed on the ingress. Traffic has always been hard to talk about when you're differentiating the name of the service and actual internet traffic, which is not fun. Anyway, you can also augment traffic with the help of annotations. And I'll also call out that traffic has its own CRD to do some similar things. We chose to use ingress so that we have some flexibility going the ingress objects. We have flexibility if we ever need to change our ingress controller. Yeah, and also call out the TLS section. That's going to be important when we talk about Cert Manager. So traffic is aware of how of the TLS information I'm setting there and the secret name I'm setting. When there's a valid cert attached, traffic will actually be the part that terminates the TLS certificate. So now we have the application running. We've got our cluster and our load balancer listening for the host name, right? So we technically could with a host hit the load balancer and get there. The problem is we don't have DNS. In the past it would be a support ticket. External DNS allows us to set up a tool that will automate DNS creation for us, right? This is actually a configuration on the external DNS service. I tell it my cloud provider and I tell it the GCP project I'm in. And then this is kind of what kind of allows the external DNS pod to make some DNS changes. So external DNS comes up in the cluster. It doesn't have permissions against Google. So we have to authorize that somehow, right? So GKE has a feature called WorkloadedDentity. AWS has a very similar tool as well. But WorkloadedDentity allows us to bind a Kubernetes service account with a Google service account, right? So that my external DNS pod can inject routes or inject DNS records as needed because it's authorized to do that inside of Google Cloud. Based on the same ingress object I created earlier. You can use external DNS a couple of ways. We're using it to target a host named as pointed at a load balancer. So what's gonna happen is the host spec on the ingress will have a C name created, pointed at my target that's listed in the annotation. And so that takes approximately a minute and a half at most for that to propagate, right? It's five minutes is the default to say, but paying attention to it is really, really fast. And so now we can get traffic to my containers, to my workload. We can see our beautiful comic sans, and that's allowing us just to get there. But the problem, the next problem we have to deal with is certificate management, right? So in the past, that was always a support ticket, going out to some provider, filling out a massive form, getting an email, stuffing that into a random secret somewhere, putting that in swarm days, we would have to manually put that into a secret and then update our workload in order to pull that into the actual application. Certificate manager automates all that. I went to the booth earlier this week and thanked those guys so much because I saved about a month and a half dev time annually just by automating certain rotation. I love Kubernetes. Cert manager may be my favorite tool in the tool chain. No joke. Okay, so a cert manager, very similar. You have some containers running in your cluster, right? And then you have to create a certificate issuer. Basically, it's an issuer that points at NSA. This one's looking at Let's Encrypt and is doing DNS challenge authorization, right? Why that's important for us? So you can do HTTP validation where your container comes up and it comes out from the inside and it validates to, hey, yeah, you own that domain. But we do a lot of internal work, right? Because we don't want to surface the front end, a dev version of the front end of CNET and Google scrape it and then kill our SEO scores. So we do a lot of internal work, right? So we need to be able to validate these certs internally so that we get top to bottom consistency with our development and production environments. So we set up the cert issuer and then we have to do the same thing with cert manager to tie the service account for cert manager to external DNS. I mean, sorry, it's to Google service account. So basically allowing cert manager to do the same kind of DNS writing as external DNS does. And what's gonna happen, I think I've got this, okay, yes, sure. So we specify the issue we want to use and we also specify the TLS information. Under the host of the TLS information, that's the sand that the cert is provisioned for. And then the secret name is the name of the Kubernetes secret that the certificate is stuffed into, right? And so, yeah, let me walk through the whole process. So what's gonna happen is a request is gonna be created for a certificate, then it's going to be sent to Let's Encrypt through cert manager. Cert manager is gonna create a DNS record, an ACME DNS record with a text record from the response from Let's Encrypt and then Let's Encrypt is gonna say, hey, yeah, you own that. Here's the certificate and it's gonna stuff it in that secret, right? And then there's a CRD available with cert manager and you can watch it happen in real time and see the status of the certs. And then once that all happens, the cert is actually inside the secret. So what's really beautiful about this, for those of us who care about MTLS or certificate validation at all, is not only can my ingress use this or any other object inside of Kubernetes, but also my pods can mount them as actual files. So if I need to give InginX a valid certificate, I can create one and hand it to it if I need to give Envoy Proxy. So in this situation, we give it both to InginX and to traffic. So all of that lets us put Comic Sans on a temporary environment, right? So that's great. That's a lot of work to get Comic Sans out there. The next thing that's gonna happen is engineers are gonna go to the product manager and be like, hey, look, I did my job. I'm gonna merge this poll request in, right? Once they get the approval and then we're gonna forget about this environment, right? Because we're terrible at cleaning up after ourselves. Everybody agree? So we wrote a tool internally called Deccan based on our CSCD process. And this is a process we wanna clean up and automate a little bit better. But today, based on our CSCD process, we get to say how long those containers run. And we set a day. And what we do is we create a config map that has a special label on it that specifically says that you can delete this thing with the date and time I'm okay with you deleting it. And then we have a cron job that runs hourly, maybe? There it goes. A cron job that runs hourly to validate where, when and where that namespace can be deleted. Once you hit that purge date, the namespace will be cleaned up and we've cleaned up that for ourselves. So that's really how our ephemeral environments work. I'll talk about the move to prod in a second. In all of our environments, we have very similar observability. We actually have Prometheus, which is a time series database that allows us to scrape all the containers in our cluster as well as resources outside of our cluster. We use it actually to monitor uptime of GCE ingrances as well, not just stuff inside of our cluster. And all of that data can be visualized with a tool called Grafana. This is one of my favorite dashboards that shows us latency, throughput. That's all using the traffic ingress controller metrics, the error rates as well. And then in the same workload, this is just one workload. I can see how how saturated the CPU and memory is. So this is if you're familiar with the SRE principles of the golden signals, these are the five, or sorry, four, the bottom tier really the same. It's just memory and CPU. All of that information then can be serviced to a tool called alert manager, which allows us to pay attention to how well our services are running, right? So we can write alerts that say, hey, if we have a delta change on error rate, so we all of a sudden have a 50% error rate increase or a 10% error rate increase, increase page the person who owns this service, or memory or whatever thing you care about. And additionally, we are investigating on all of this together, looking at our cloud provider Google with their observability tooling as well. This is the same dashboard reimagined with Google's dashboarding as well. So the cool thing about Kubernetes and using your cloud provider's native tooling is that we have crazy visibility into our workloads today, right? And we can see when things are going well, and when things aren't, and we can be alerted on them as necessary. Okay, so I've talked about a sandbox environment, and that's probably not what you want to hear. You want to hear what we do for production. We do the exact same thing. All the way top to bottom, development, sandboxing, production, the only things that change are these things, right? How we handle internal versus external traffic routing, whether it's coming through an internal load balancer or external load balancer. And that's usually handled with annotations on Ingress and logic gates and helm. How we handle replica counts. We don't run one pod in prod, right? We're going to run close to a couple hundred. We also open up resource requests and limits in production. We do over commit because we want to make sure that we're not throttling the application. We want good user experience. That's more important than being really, really tight on resource utilization. And then workload separation. So this is new for autopilot. A guy by the name of William Dennis talked about this last KubeCon where you can specify, basically, create ephemeral node groups based on a label you put on your workload. So that, for say, for instance, you have a workload that is your back end workload, right? Or a workload that is your front load workload. And you want to make sure they're separated so that you don't have noisy neighbor problems. So this is a new feature, but also the idea, if you've not read about Okanus unblanking pod anti-affinity, there are ways using labels to say, based on the name of my pods, make sure or try to separate the workload across multiple nodes. So you have true high availability. Having 50 pods on one node is not high availability. It's not the same thing, right? So we want to make sure we have 50 pods on 10 nodes, so we have five deep across our cluster. So this is all great, right? So we started this journey three years ago. This is where we want to go. So we literally, we're one API away from being completely in Kubernetes. And our next steps are to look at GCP managed, or Google GCP managed Prometheus, primarily because we want longer term metrics. Right now we're limited to about 30 days on disk. Ingress API to Gateway API, or if you're in this world, you're probably thinking about this as some degree because it seems like Gateway API is the next step forward. Policy management and Kiberna, right? Emission controllers, the ability to put guard rails up. So the more we can give permissions to our engineers to put stuff into Kubernetes, we want to make sure they're making really smart decisions and not running as root, for example. And also mutating web hooks. We like to label our things for information. So we can force, with mutating web hooks, we've found ways we can force some labels onto some workloads so that we are able to follow billing cost and security cost and security management and stuff like that. And obviously supply chain hardening and improved observability. So obvious from our observability slide, that's something we're currently actively looking at is how do we look at this stuff in a much, much better way. And then my big one right now is GitOps and a pull mentality. So we've been around for a long time. Push mentality's been around for a long time, right? And so this GitOps pulling the changes into your cluster, into your environment is new and it's something we're investigating. We're looking at Argo and other tools as well for that. Okay, so that's how CNET really kind of runs our environment. What's really interesting and to think about is not necessarily, that's how we run it today. So CNET's been around for 20-something years, right? But as almost as old as I am, which is kind of funny and fun to work on to a degree. But before we adopted containerization, we could deploy once, maybe twice per week. The process was slow and tedious. We had to have release engineers. We had to do a lot of coordination. We had to think about heavily, how do we take this change, this one line change and march that through our production, to our production environment, right? Today we, man, I don't have a really good metric because it varies so much, but we deploy anywhere between 10 to 20 times a day to production and then much more than that in non-production, right? So we can make a change in hours and get it to the front door of CNET. That's honestly the big value proposition for all of this tooling. We could consume a managed service and probably run some of our services on that. But the flexibility and the speed our engineers have to meet the business requirements that are kind of laid out for us is why we're here, right? So that's why I like it and also, yeah, and also I've been a huge Kubernetes fan for years and I was brought in to do this, so it's fun. And that's how we do it. If anybody has any questions, I'm more than happy to answer as much as I can. Sorry for the vast amount of yamble and speed. We went through that. Yeah, let's get a mic, okay. Yeah, let me grab a mic so everybody can hear you. It's a big room. Just curious how you guys do the test automation and sanity checks when you're pushing 10 to 20 times a day. Okay, so all that's actually handled by our CICD system, primarily, this is really focused on just getting the deployment change, right? So this was the CD less CI, like the integration validation pieces of it. To be open, I'm not really on that team trying to think of the name of the tool. There's a node framework that they're using for the node applications. And then we use PHP and Symphony quite a bit and they have some frameworks around that too. I know those tests pass and fail pretty regularly because I've seen them, but it's not part of my purview, right? Another question? Yep. You said you're exploring on your CD strategy, right? With Argo CD and some other tools. Can you put some light on your process, like how you're going about it? On how we're looking forward to the move. Okay, so at the moment we're looking towards, to be honest, we use Jenkins today, right? And so it's been around for a while. It's going to be a slow migration out of that for us most likely. We've played a lot with Cloud Build and Cloud Deploy and then Argo CD. Those are kind of the three things we're kind of exploring. We'll probably move our integration pieces to Cloud Build because it's native to what we're doing and it ties into GitHub, but then the deployment process is still up in the air. And that's really going to be us just kind of like using both for a while and going, we like this one better, we like that one better. Tecton? No, but actually I'm really interested in Flux after watching a couple of demos this week just to be completely open. I think it's a cool tool, yeah. Do you guys have like a single production cluster or multiple production clusters? So when I said, and friends, right, there's more than just CNET that I manage, right? So we have 30 clusters in prod and 30 clusters in non-prod today. So you just, do you have like then a single a Prometheus server per cluster that you have to switch between today? Yes, so our Grafana is gross. We flip between the different Prometheus. Yep, there's somewhere here. You mentioned you're using alert manager. Do you use it for all of your alerting? Or do you have several systems? Well, okay, I take that back. In the past, it was our only alert manager, our only alerting tool for the operations team. We have moved, we're investigating the alerts that are built into Google, Google's infrastructure as well as the alerting as part of Grafana to a degree. And would you consider, so you mentioned GKE managed Prometheus. They also are compatible with like exporting and importing all kinds of Prometheus metrics to other Prometheus instances. Sure. Would you consider using alert manager at scale for everything or would you rather move the other way around that you're taking stuff out of alert manager and moving it down to a managed system? So my team is, we're a small team, right? So there's, I think there's six of us now. And so we will consume a managed service quicker than we will an open source service if it meets our needs. So it depends. I don't, they're like so black box exporter not being available right now with managed Prometheus is gonna hurt. Also autopilot doesn't support managed Prometheus at the moment. So it's gonna be a minute before we make that shift. But it is something we're exploring. Does that make sense? He's got a mic right behind you. You mentioned about sandbox testing and the sandbox testing obviously you don't want to expose to the public. How do you control access for the product manager who you want to see those changes? So we're a legacy company, right? So we have a VPN. So it's truly internal, right? But IAP if you haven't read about Google's service allows you to expose stuff externally but validate a Google account before you get into it is one way that we are exploring moving away from that but also use of headers. So we can do some header magic but we can flip that internal external flag with a true false, right? So if we need something to be truly external and validate through the CDN we can have a test CDN service and actually validate the entire stack. So it depends. Sorry. Any other questions? Hi, thanks for your talk. I was just curious about how you're addressing security needs or where that falls in your priorities. In what way? That was weird. It was probably behind the ear. Just I mean securing the application overall. So for like containers scanning and paying attention to GCP we have a separate security team but we use a tool called Wiz is the tool that our organization has chosen to use but also we use Cloud Armor WAF. We were, oh goodness. I can't think of the name of the... We had another tool recently that we just migrated off but again we'll take a minute especially if it aligns with our cloud provider. I'm happy to talk after but thank you guys so much. This has been a lot of fun. This is my first coupon talk if you couldn't tell that I was really nervous. Yeah.