 Let me say it's a huge honor. And let me first thank the academy. Wrong stage. All right. My name is Ilya. This is Z. And today we'll talk to you about the HBO journey to Kubernetes. The journey which began not that long ago from not having a single service running inside the container to hosting Game of Thrones season 7 on Kubernetes. Z and I will split this talk. I'll tell you about why Reasons and Z will cover how we get it done. OK. It's downhill from there. HBO has come to the shows that everyone is talking about, from the groundbreaking series to the documentaries, sports to the biggest blockbuster movies available anywhere. For over 40 years, people who love entertainment have recognized HBO as the original, the first, and the best place to find world-most innovative programming. HBO Digital Products is represented by HBO Go, which is part of your TV programming subscription through cable, satellite, or other providers. And HBO Now, subscription directly with HBO. Both provide unlimited access to HBO programming just about on any device. In broad sense, Digital Products is everything and anything to do with the content streaming. And Digital Products is where Z and I work and platform team. So if we look at the HBO streaming services, this is how they could look like from the mile high view. This is not the actual image, but resemblance is pretty close. And if we zoom in, they could be best described as a mesh of API services written mostly in Node.js. Also, we added more and more services written in Go. And all services were deployed into EC2, into single service instance, single EC2 instance paradigm, all wrapped with other scaling group, which handled both deployment and scale, all fronted with the internal or external load balancers, and routed using handled DNS needs. Overall, it was and still is tried and true set up for running services on AWS. It works for general case, however, if you will see next, that HBO case is anything but general. HBO traffic pattern can be best described as the wall. And this is just a random example of how fast playbacks are started on Sunday night during the Game of Thrones season premiere, a season around 6 PM, known as a prime time. And I think this point in time can be best represented by this image in terms of what our API service is faced with, as well as the emotional state of our engineers. And so looking at Game of Thrones' traffic pattern, episode after episode, season after season, left us with very unsettling feelings about our future. Can we hold on during the next episode? What about next season? And the answers were less unoptimistic because we were running into multiple problems. The chief of them would be underutilization. Running Node.js implies that you can utilize a single CPU core at most. Now, since we're deploying in EC2 instances, EC2 instances with single core typically almost always do not have good network. So to find good balance between network and CPU, we had to select instances which run at least two cores of CPU. So right there, we would not be utilizing 50% of our CPUs across all deployments. Underband scaling is good, however, it's slower in comparison. And sometimes it is inadequately slow to react to spike traffic. Thus, we have to overprovision our deployments, doubling, tripling, or sometimes more to accommodate for unpredictable traffic patterns. So you'll take initial 50% underutilization, buffer up with overscaling, and multiply by a number of regions. That will be a lot of unused CPUs. So as for ELB, it goes that every service communication requires ELB. And again, it's tried and true ELB plus ASG approach. However, even for internal communication, within the same VPC, we're required to stand up ELB for services. Again, that resulted in a lot of ELBs, which led us to problems, limits, or otherwise known things that we run and out of. It's ironic because we were underutilizing more than 50% of our CPUs, and yet we were running out of all other resources. So to keep up with the resource usage, we had dedicated alerts which will fire up every time we cross 80% threshold utilization on given resources, ELBs, ASG security groups. And when we get notified, we'll contact AWS support to increase our quota limits. Of course, things got more interesting when we began to run out of instances for given instance types, or IP addresses for given subnets, when even AWS could not help us with our problems. Then, of course, there are external resources that directly tie to instance count, like a telemetry provider who would calculate the license usage based on the instances. So this brings us to Kubernetes. And rather than going through these bullet points, they're all true. I will tell you my personal story. In fact, this is my second time being on the stage at the KubeCon. First time was in San Francisco in 2015 when I was summoned to stage just like today by Kelsey Hightower, but my great surprise in horror because I was not supposed to present. What happened that, walking through the hallways of KubeCon 2015, I dropped my wallet somewhere. And someone had found it and returned it to event organizers. And he called me to stage to return my wallet. So I cannot simplify. And a better example of Kubernetes saving the day, maintaining services up and running and preventing outages. Otherwise, it would be very interesting to fly home. So I was settled on Kubernetes from the start ago. But we did do a diligence. We looked at the messes. With this year, we looked at SWARM. We looked at ECR. And for us, Kubernetes wasn't still is a clear winner. We were just at the beginning of our journey. And we had a very tight schedule, given that we had to continue all the services first. And containerization and mass scale is a huge undertaking on its own. So once we began conquering that hill, though, we started playing with Kubernetes and AWS. And again, this is the end of 2015. A lot of change since then. So what we started with, with most basic setup available, running QBAP to deploy Kubernetes into our existing VPCs. We had to tweak and make some configuration changes. And what we needed to do, we needed to show to our peers, to our bosses, to ourselves, that Kubernetes is not a vaporware. But more importantly, that Kubernetes can be operated in AWS cluster to host production-grade services. And once we got, we started cutting our teeth with Jenkins infrastructure cluster. And once we got successful with that, we began to provision a home cluster to our streaming services. And that's when we realized that basic setup simply won't cut it, and we had more work to do. And Zee will tell you how we got it done. Okay, I'm going to tell you something we learned from our Kubernetes journey. So we created our own tower form templates for provisioning our clusters. We started before some of the community projects started. For example, our Qube AWS, Kops, or Qube Spray. This allowed us to do something really cool. For example, we can deploy our clusters into our existing AWS infrastructure by providing our VPC IDs, subnet IDs, and security group IDs. We also had high availability in mind. So from the very beginning, our Minions and master ASGs are multi-AZ. The purpose of the ASGs are different, though. So for masters, we want to maintain a fixed number of nodes. So if one fails, AWS will automatically launch a new master. For Minions, we want to scale up and down very fast. So we use ASG to launch and to terminate nodes. Master is also running in HA mode, meaning that API servers are low-balanced. And schedulers and Qube controller managers are running as leader and followers. Despite being homegrown, we keep incorporating best practice from the community. We turn on OIDC or OpenID Connect authentication for a Qube API server. So as long as our developers' GitHub accounts are in HBO organization, they will get a token for their Qube-Cuttle authentication. Terraform modules is a great way to promote reusability and modularity. We created self-contained Terraform modules for both Kubernetes masters and Minions. When we want to launch on the cluster, what we do is we compose a Terraform template like this and we will run Terraform apply and, bam, we have a cluster up and running. Several weeks later, we have had some experience of how to operate a cluster and then we noticed a few problems. So first, we run Prometheus in the cluster to script metrics with a provision IOPS EBS volume as data storage because our cluster scale up and down all the time. Sometimes the minion that hosts the Prometheus part gets terminated. We have to wait a long time for Prometheus part to come back because Kubernetes has to detach and reattach the EBS volume. That process could be very slow and during that time, we were losing metrics. The second problem is for big events like Game of Thrones premiere or finale or simply our regular load testing, we have to pre-scale our cluster up significantly to overcome the war effect Elia just mentioned. And sometimes AWS cannot provide sufficient capacity of our desired instance type for the minions. So these issues led us to an improved version of our Terraform modules. First, we added the instance type variable to our minion module so that all the minions launched from this ASG will get this particular instance type. We also added a tank var to the minion module and passed that as a Qubelet startup flag. So if this var is defined, then all the minions launched from this ASG will register with that particular tank. Again, we benefit from the modularity of our Terraform code. So for regular minions, we pass our main instance type to the module. We added another module to our cluster. We call it backup minion. These minions are exactly the same as our regular minions except that they run on our backup instance type, which is C for the ATX. Another module we added, what we call reserve the minions. So reserve the minions are, again, exactly the same as regular minions except that they are tainted by this tank to reserve equals true. At the same time, we updated our cluster autoscaler so that when cluster scales down, the reserve the minions are excluded. So to summarize, we added two new minion ASGs to our cluster to address the issues we had earlier. First, if AWS runs out of the main instance type we want, we scaled up our backup minion to bring more capacity. And second, for Prometheus, we updated the Prometheus deployment to have affinity to reserve the minions and also tolerate the reserve tank. So in this way, Prometheus part is not interrupted, even when cluster scales down. Flannel was the networking layer we chose at the beginning. Compared to other solutions, it was simple to set up, especially before the CNI came. And it is included in every core S distro, the distro that we use. However, when we were doing our regular load test, we discovered that on the heavy load, there were some problems. First, it was increased latency and timeouts, both between pods and going out of the cluster. And second, there was UDP package drop, which impacted our QtnS lookups and custom metric collection, both of which are UDP traffic. These are just two issues that we saw very frequently during the season. The GitHub links are on the slide. Now let's talk about different types of services that we tried and the different ways we used to get traffic into our cluster. First, we provisioned ELBs for every node port type of service and associated these ELBs with the minion ASGs. So in this way, all the minions launched from the ASGs will be registered with the ELBs automatically. However, there's an AWS hard limit of 50 ELBs per ASG. And also, since we are provisioning ELBs ourselves, we have to keep track of them manually. Next is Ingress, which I think is the most common setup out there. So we put a shared ELB in front of Ingress controllers and the ELB forward traffic to Ingress controllers, which then proxies traffic to upstream services. However, there were some problems with that too. First, when we looked at the CloudWatch metrics for shared ELB and we noticed some 500 errors, but which backend service exactly these 500 comes from? It's pretty hard to tell without scrunching the Ingress controller logs or ELB access logs. And second, we noticed that Ingress controllers seems to struggle against a very burst or spiky traffic that we saw. And this setup produced more connection timeout errors in our load test versus the node port setup on the previous slide. And finally, the publicity of your shared Ingress ELB will determine all the publicity of all the services. And last but not least, we tried low-balancer service type, which is cloud provider-specific. In this scenario, Kubernetes actually handles the provisioning of the ELBs and registered minions with the ELBs. This method is not affected by 50 ELB limits. But first, we noticed there was an AWS API throttling problem. And second, there was some ELB security group customization issue that didn't get resolved into a recent 1.8 release. So at the end, these are our choices for services and Ingress. For production services, we use node port plus individual ELBs. For non-production services, we use Ingress controllers and shared ELBs. In both scenarios, we use the built-in service discovery for making calls between our microservices. QtnS is always an interesting topic. Have you ever looked at the resolve.conf file in your parts? Basically, this is how it looks like. First, you've got a bunch of internal domain names to search for and AWS domain, internal domain names. So this code is actually from a part in default namespace. So you see the default service, the cluster local there. Second, you got your service IP of QtnS. And finally, there's n.5 option. So this option basically means that there will be many invalid and unnecessary DNS lookups. Basically, this is what happens. So for example, if you want to resolve a DNS called pjsql.backend.sql.com, because it has less than five dots in it, we will append all search domains and try them first before an actual DNS query happens. So why n.5 was chosen was explained in detail in this ticket by Tim Hawken for reasons like same namespace lookups, cross namespace lookups, and of course cluster federation. In the next following slide, we will share some of the tunings we have done to reduce those invalid lookups. Okay, these are some of the tunings that we found very important to us. First is the cache size of DNS mask container. I think the default is somewhere around 100 or 200, but you should set it to max, which is 10,000, unless memory is a really constrained in your system. Well, setting it to 10,000 will only cost you additional couple megabytes of memory. Dash dash address flag, it is a really big performance booster. So this flag tells DNS mask to return an IP address for a specified domain name. However, we use it slightly differently. We specify a whole bunch of invalid domain names that were created by n.5, and we do not specify IP address. So effectively, for these invalid domain name lookups, DNS mask will return not found immediately instead of doing an actual lookup. This way, we speed up things a lot. So if you haven't taken a look at your Qoop DNS deployment, I recommend this flag. And finally, if you have some internal name servers you wanna hook up, the dash dash server flag is for you. All your parts will be able to resolve internal domain names without additional changes. And with that, I'll hand it back to Ilya. Thanks. Quick word about telemetry. It's not surprising that we didn't have any container services before. We couldn't take almost anything from our telemetry stack to Kubernetes, except Splunk, and with Splunk, telemetry team did some heavy customization and tuning for Splunk forward to get reliable logs. Everything else on this slide was new technology to us, and I think it was a great thing. Zero dimension special case for reserve instances, for staple service like Prometheus, and we also love Prometheus. However, running EBS with the availability zones and no affinity for Prometheus, juggling of it, it's not fun at all. And speaking of EBS, it can have some interesting mountain and a mountain times. So we evaluated Rook with a great success, and we just didn't risk to put in production before Game of Thrones season. However, we excited to see Rook become a CNCF project. It's been submitted to TOC. Okay, so for C-Advisor is one thing to consume metrics from C-Advisor running your infrastructure cluster with few nodes, two CPU cores each, and a couple Jenkins Cloud deployed. It's totally not to run on the 300 node cluster with 40 cores each and more than 20,000 containers deployed. Consuming metrics at that scale felt like drinking out of the fire hose, and we had to do some extensive metric tuning, filtering, and Prometheus memory adjustments to get metrics in reliable state. So when do you know you are ready? Ready for Game of Thrones season premiere. For us, it boiled down by setting up the bar, and in terms of threshold, viewership, and ramp-up speed, and beating it with a low test. So for about two or three months leading to Game of Thrones season premiere, we ran a weekly mega low test, and first attempts were just beautiful. It left us in ruins, and that's when the real work began. It began on both fronts on services side. Our service engineer did some heroic job investigating issues and fine tuning services to accommodate for new environments. And on Kubernetes side, when we began to look for issues, reporting if none were found and fixing what we could. Slowly, we began to merge in better shape, gaining more confidence in Kubernetes and services running on it. If there is any moral to our story, after successful Game of Thrones season seven on Kubernetes, it feels good. It felt good to be right perhaps for the first time by making the right choice. And if there is any advice we can give is trust yourself, trust your team, succeed at small things, and you'll be well positioned to succeed at big initiatives. And it won't always be a smooth ride, but you and your systems will emerge in better shape than you went in. For us, many problems we found in our services were not caused by Kubernetes. They were there all along, Kubernetes made them more visible. So as mentioned earlier, we looked in alternatives. However, the biggest undeniable most important reason why we choose Kubernetes was vibrant and active Kubernetes community. Without all the GitHub issues and fixes, without C groups and Slack channels, without meetups and KubeCon just like this, there's high chance that a journey would not end well and we most likely end up like these two guys. But likely it didn't happen and here we are at KubeCon telling a success story of Game of Thrones season seven on Kubernetes. Thank you.