 Cool. Hi, everybody. My name is Julian Portillo. I'm from Relativity. It's a small enterprise company that is making a migration from VMs into Kubernetes and we have run into a lot of foot guns and hopefully we can help you guys not hit as many feet and instead hit targets. So we started to migrate to Windows containers in June of 2021. For the first nine months of this year, we've gone through 192 million Windows containers that are actually like production workloads, 50 million that are doing canary workloads, like testing see if nodes are working, and we found 105 failed Windows nodes. So obviously if you don't have an automated way to catch failures like that, you're gonna have a very bad time. So we'll help you figure out how you can catch failures such as those. The number one lesson to take away as you migrate to Kubernetes on a Linux world or a Windows world is these things are cattle not pets. When you start your migration, pods can be the cattle and then as you get higher and higher in your migration, eventually nodes can be a cattle and shoot those things in the head, find new things, and then eventually you can get to the world where clusters can be moved around and swapped at will and that's where you're in a very happy spot from a DevOps world. So here's what we're going to go through. First, who are we, what is the team that I work on, and what is relativity? Then a quick understanding of what our first Kubernetes setup was. Why we started to use Windows containers and like the general motivation for it since it's not a super common way to move two years ago. And then Windows container pain points we ran into, how you can avoid those, and some recent very positive changes like literally this morning for Windows containers that you'll probably be pretty excited about. So who are we? I think in engineering, especially platforming DevOps, it's important to remember that it's a team sport. You have to make sure that everybody in the team enjoys working together and that you have, there's no one individual who knows everything about everything on what you're working on. On our team, we have about 30 really smart, motivated, happy platform engineers that work on making sure that all of relativity works well. We've supported about 500 different application engineers and we have our own music video, we've got team logos, we've got movie trailers for demos, and we do pretty fun game days where we like see how to break things in production and then how to come back from the dead when that happens. Our legal team said I could not share our music video, team logos, or movie trailers due to copyright infringement. I'm not really sure what that means. So instead of you stable diffusion to generate really terrible clip art that goes along with this. So enjoy. So what is relativity? If you all remember Tannenbaum's statement about never under estimating the bandwidth of a station wagon full of cassette tapes going down the highway, relativity is attempting to solve the same thing for 747s full of documents. When two companies sue each other, you go through a discovery process. There's tons and tons and tons of paperwork. In the old days like the 80s, they would go and fill up 747s with boxes of documents and then send armies of lawyers into the hot hot sun to dive through these documents and write things down on pen and paper in them and then send faxes and printers and all sorts of nasty things. Relativity about 20 years ago started to see that that was not really the right way to handle this and we started moving into the e-discovery space. So around 199 of the top 200 law firms now use our platform for moving around all of their documents for e-discovery and we're pretty ubiquitous inside the legal space. If you're not a legal nerd, then you probably have never heard of us, but we do a lot of documents and a lot of NLP on those documents. How do we use Kubernetes now? We have Kubernetes clusters in 20 regions spread around the world. We have to have them in certain regions for data protection laws. Turns out that you can't really move EU data into the US without getting a lot of regulators very upset. US stuff you can move around, but that's another story. We have an orchestration system that can handle millions of containers per day. We have automated vulnerability patching and dynamic image changes. So we've actually just on Tuesday updated OSs that we were running on in production during business hours without any customer interruption. We can do some pretty cool stuff. We run hybrid Linux and Windows clusters. So we do workloads that are doing Linux and Windows at the same time. How we started, when we started our Kubernetes journey, we basically had a lot of VMs that were doing just machine learning and NLP things where they would run for a very long time. They didn't need to use a lot of CPU for a little while and then they would just kind of sit there and waste money for us. So our very first thing was moving some machine learning and Linux Linux workloads into Kubernetes and we could scale up and down at will. We got all the really nice things about having like a neat, scalable, multi-tenant, highly-villable architecture with full CICD to production. When we started on this road, it would take a relatively developer six months to a year for their code to go from off their machine and into production, which that is a terrible experience. Nobody likes that. That is like, I would cry if I dealt with that on a day-to-day basis. I cry thinking about it sometimes too, just for fun. Now we have full CICD to production. We've actually, for one of our demos for vulnerability scanning, we are vulnerability patching. We ordered a pizza, deep dish, and tried to race our deploy of our patched images all the way through our production clusters before the pizza was delivered. So we cannot beat thin crust because we have safety checks as we go through our various rings of production, but we can beat deep dish. So someday we'll work on getting better with that. Also after the talk, we can talk about which kind of pizza is the best, Detroit or Chicago or New York. I think it's Detroit. I'll say it. I'm here. New York's good too. So Chicago. So a large portion of the workloads at Relativity were on Windows. We were a .NET shop. We started out doing, you know, Microsoft Word documents, PDFs, and like just moving things around that way. So the general gist is we had a lot of Windows VMs that saw how exciting our Linux workloads were working and then said, oh, that's great. I want to deploy all the way to production with CICD. I want to be able to get my code out there and in my user's hands very quickly. That should be really easy, right? Sure. It is really easy if you think about how you're going to do the migration and if you take the time to pay back tech debt and to think about how you're going to adjust your architecture as you move from VMs to the Kubernetes world. We have some groups at Relativity that did that and they've done very very well using Windows containers. They've migrated and had very few issues. That's not fun to talk about though. The really fun stuff is where we've run into a lot of fun issues like a hundred five thousand broken Windows nodes. So as Tolstoy or Liz Fong Jones from Google SRE said, happy Kubernetes migrations are all alike. You take your time, you think about what you're going to do, you become an expert on what you're going to work on in Dev, you scale up in Dev, make sure you understand how things are going to work, you pay back all your enterprise tech debt, and you go make the argument to all of your business and development leaders saying, hey, we need to do this or else we're going to run into problems. Unhappy migrations are all peculiar because there's going to be certain things that say, hey, I can't change this. This is not a thing that can change for the cloud native world and they don't realize that if you don't pay back your tech debt, your customers will. It isn't just a thing. You can't just say I'm going to make this migration. I'm going to change nothing and everything's going to work really well because you're going to run into problems. Obviously what we did was we tried to go and just power through it. We took our, we had the set of systems that was running a ton of processes. Basically whenever a user would click a button inside Relativities UI, it would go spin of a process, do some work, or you could schedule a cron and it would go do some work on different things inside a set of VMs. Scaling up those VMs was a manual process for our ops teams that took up to 10 hours to do. 10 hours of ops teams actually working on them. There was a fair amount of tech debt inside our pipelines to do those scale-ups. So now everyone is very excited about being able to scale up like at the push-roll button and move things forward that way. So we went with some very happy, happy ideas. We were just going to take basically each of those processes that were running on our VMs, package them up into a container like you would do in Linux, and then run it on Windows Notes. There's, for people who have worked in Windows containers already, you'll notice that there's a lot of very interesting problems there. The average Windows 2019 container, as of like two years ago, started at three gigs. When you actually add in a list of stuff you need, goes up to six gigs or 15 if you aren't very careful. On Linux, you can do this. You can run, you know, sub-100 megabytes, sub-15 megabyte processes that do everything you need. Distress containers are really great. And you can just run things very quickly, have an average life of 30 seconds, and be off to the races. When you try to do that with Windows, you will have relatively long pull times. If you're not careful with how you're pulling onto the node when you're starting these things, you can also get into some very nasty situations where nodes will just fall over when you're pulling like 40 pods at once that all have 10 gigabytes of data. So, as you can guess, our initial results ended up in sad cloud. We had some common networking failures. We had some common container start issues. There was a huge lack of visibility, and there was a huge lack of common open-source tools. Like Linux node problem detector is what everybody uses in Linux, right? You can identify when your node has fallen over. There's lots of cool statuses. If you find a new status like Plague from like five years ago, somebody's probably already noticed that and added a patch to Linux node problem detector so you can catch that. That did not exist for Windows containers two years ago. So, we didn't, we couldn't install host containers or privileged containers at the time, so you could not get access to the host without setting up a bastion server and then RDPing into the host, which is a real pain, and nobody wants to do that, and no security team will let you do that in production, where you generally run into issues. If you can catch scaling issues in your non-production environments, you can stop your promotion to production. If you can only catch them in production and you can't RDP into your node in production, you are in a very sad spot. So, you have to build better and better scaling tests in your non-production environment and get all sorts of fun things there. So, no privileged containers, no metrics as well. That has recently changed with host containers and Windows node exporter, and there's no easy logs, and we couldn't do very easy scanning and defense for vulnerabilities. We had to do some interesting changes in terms of how we were scanning nodes, or scanning containers. With Linux, if you use like an off-the-shelf container scanner, you can generally do your scanning on your production nodes. If you have a large enough pool of nodes, you won't run into huge issues with performance. We tried to do that with Windows and ran into massive issues, so we started to do that kind of scanning off in our non-production environment instead, and then tracking shaws of what we were actually deploying to keep track of whether there were vulnerabilities, and the general debugging nightmares that we had. So, our first solution to get around these, you can't solve problems if you can't measure them. So, we started to run code on the host. We set up different code to export to punt out to blob if we saw there were issues to identify what things we were dealing with. We started to set up things where we could actually like punt out metrics that we cared about from the containers that we cared about, and we made sure that all of our teams had really solid logging from the containers. One really key thing that let us do that is we have a base image for all of our agent teams, or all the developer teams that are building these processes that are these Windows container processes, and we can easily update for all of our user teams everything that they're running from a centralized location. So, if we do find there's a vulnerability, our orchestrator plus that imaging service can go make it so we fix that vulnerability, or if we find there's a bug with how we're pushing out logging, or you want to change the amounts of logs we're getting, we can easily do that and adjust even for certain regions. So, make sure you don't, even more than the Linux world, make sure that you have your users all inherited from a single base image that your platform team can control, unless you have very, very solid developing teams that really know what they're doing and want to really go around, and then be careful still. And then we use Kubernetes events and the API to get as much visibility as possible. What we learned was that on a fairly significant number of Windows pod start-ups we were getting different Windows node failures. Sometimes those to clear up, sometimes they would not. We were seeing a large number of HCS shim errors, and what that is is, this is a beginner talk, so we're not going to go into huge details. If anyone wants to go into details though, I'm going to grab lunch after this, so feel free, we can go talk and commiserate. But anyway, HCS shim errors were the bane of our existence for a little while. They would cause random pod failures. We catch them via those events, and then we found that the node generally did not recover after those HCS shim errors. We actually worked with some Microsoft folks. Thanks guys, they're over there. We shared some testing that we had and showed that, and they started to patch a fair number of those errors. So first shot, Windows Proctor, we looked at all the events, and then we tried to cordon. That worked actually really well. We were able to catch things, it was very fast too. We could easily catch an event, and then cordon a node, and we saw that event go away, or if we saw that event go away, then we could let that node back into the friendly pool. Problem with it though, there were even more variations on failures that did not work very well for us. There were tons of ways that we get failures that we just weren't ready for. So as an example, one of the failures that we're going to talk about today is host marking system failures, HNS failures. And we're going to talk about how that has actually had, with a live demo, we're going to show how this actually had a pretty huge change in the last five hours in four regions in AKS, which is kind of cool. So we also ran into new failures only at scale, so this was not a great way to go about trying to catch things. We had to continually update our event logger, we had to continually update what we were looking for, and it was just not a very fun thing. And as you're scheduling tens of thousands of containers in 20 minutes, you actually tend to put a fair amount of stress on the control plane. So you don't want to have a lot of different things looking at all the events coming off of your cluster. Yes, you can send them to another spot and then watch over a centralized location, but we weren't really set up for that at the time. So our second version, which is actually what's been in use as we've scaled up on our Windows container usage, and it's a really simple solution that has worked very well, which is my favorite kind of solution. We've had to do just very minor code commit updates for it as we've found vulnerabilities and dependencies. All we do is we schedule, we have a pod scheduler on a cron that schedules to every node that comes up, and if the canary fails, we core in the node. Then if it passes, we wait some period of time, schedule a pod again, see if that pod can connect to a variety of things we know the pod needs to connect to and do stuff it needs to connect to, and or checks a little bit of performance. And if we notice some really bad performance, we also get rid of the node. And if the canaries fail and more times drain the node, if they were passed, then encore the node. Super easy, super simple, and it's worked really well. This is what our actual growth pattern of Windows containers that are running in customer workloads has looked like. So in June of 2021, people were kind of excited. We tried to test it out. That's around 40,000 for our first month. We started to notice issues. We cut back and switched back to our non-criminaries VMs to run our agent workloads. As a platform team, we started to look at what issues we were having and tried to start fixing those, working with Microsoft, working on our own to try to solve these problems. And then we restarted the migration in August. And basically for the last two years, we have an 8,000% year-over-year growth, which sounds cool, but is not a good thing. We don't actually get paid money per Windows container that we run. I've tried to make an argument for that for our customers, but nobody has taken me up on it. So as you'll notice, we've actually dropped in the past month. We've been switching a lot of our deployments from this really fast schedule of containers to instead have a multi-tenant shared container that runs workloads. We can get the same amount of customer work done, and we do a lot less like platform work. It makes life a lot better for our user teams, and it makes life a lot better for our platform teams. It's a little bit less fun, because it's kind of fun to think about, like if we can get to a billion containers. But if I ever come up here and give a talk on scheduling a billion Linux or Windows containers in a month that isn't just like a performance check, I will not be happy. And nobody will be. So this is kind of a, you know, we were talking about 105 thousand Windows node failures. It was sort of scary. The general gist here, though, is life has gotten a lot better. Switching to 2022 instead of 2019 in one of our test environments caused our failures per day in that test environment to drop from 22.3 to only 2.6, which is pretty huge. There was no actual change there. These are, no actual change other than switching between the OSs. Our code we were running and the amount of containers we were running was the exact same. All the systems exact same, and our stress test was the exact same. We're running on container D for both 2019 and 2022. And once we have this HNS patch that we're going to do a live demo of in a moment, we've cut down to .14 failures in that test environment, which is pretty cool. So starting about June, I think a lot of other people inside the Windows container community started to notice these HNS failures and like tried to debug where they were coming from. We also have been hit by this in a major way. Our mitigations that we had in place were starting to fall a little bit at scale. Turns out when you're scheduling like 30 million containers inside a single cluster, it's not fun to deal with those failures. So we started to dig more deeply into what's actually causing these failures and like what was happening on the nodes when we had these. And because of our host containers available, we could poke around more easily than already peeing in and we could write things that would go capture logs, send them a blob and then debug what was happening. We noticed on 9 out of 10 of our node failures at the time we started grabbing these failures, the HNS was crashing on and for anybody who's looked into this, on 2019, if you try and reboot HNS and like Qproxy, it just won't work. It will take a very, very long time. It'll take 110 minutes to be like a decent size cluster. On 2022, it'll come back relatively quickly. But there's also other issues if you don't be careful with relying upon just rebooting HNS and Qproxy. If you don't also check for failures of certain pods when everything was back, you can have pods stuck in running with no actual working load balancer. So, side note, build a check for that with your pods. Anyway, we found that out. We talked to our friends at Microsoft and we said, hey, we're noticing this a lot. We noticed the segfault when we scheduled a ton of pods and we built a little test script that goes and schedules even more pathologically than our production use case and can recreate this failure pretty much at will. So, the engineers at Microsoft started to figure out where that segfault was in HNS and then they gave us an unsigned binary to test in our regression environments. So, we set up some ways to test that and that's what we got to this HNS patch. We, this, actually, these numbers, the .14, are from the update that got put out. We actually grabbed the signed binary instead of the unsigned binary and tested it out on our own. The other very important thing if you guys have not made the move yet is switch to container D with the exact same production cluster where we're seeing 436 on average node failures per day, which is not a fun experience. Make sure you automate catching these things. We only had 82 once we switched to container D. Yeah. And then once we did container D, comparing 2019 and 2022, we've caught 37 in our production environment. You'll notice this is different than our test environment. Our test environment showed like, you know, 90% drop. This was only 50%. Production is different than test. You can try to do as much stuff that is similar to test or similar to production in your non-front environments, but you will still run into issues is what I found. Okay. And then here's the live demo portion. This morning, like 7.30 Detroit time, I got a note from some Microsoft devs in Shanghai that they had released the AKS build for the H&S patch to for the regions that are are busy. So we asked them to release to when I started this talk, I went and checked out what we what it looked like right before I started this talk, I want to check out what it looked like. We had five failures on our 2019. We had zero failures on our 2022. That is not fun. So my friend Mike is running a stress test right now as we speak on our 2022 nodes, so we can see how this worked. Let's see. This is our fun Slack channel where we report all of our non production canary things. We also have metrics that we use and stuff, but I like seeing things in Slack because it's fun. Looks like we have some more 2019 failures since we started this talk, but zero 2022 failures. So that's pretty cool. Yay. So I would strongly recommend if you are running Windows containers in production to try to upgrade as quickly as possible in those regions where it's released. Right now it's only in UK SO, CACT, Canada central. So you can go across the river over there and say hello and say enjoy your stable Windows containers. Also in central US and eastern US. It's going to roll out to the other, I can put this in, I'll update a new slide that I can put the exact like GitHub commit to follow, but it'll roll out to the other regions over the next two or three weeks too. So make sure you update, make sure you grab that. You'll be a lot happier. Okay. So, tail shoe designs, things have been really sad, really scary. This is only for some subset of our developers. The developers who built long-lived Windows containers that were just doing workloads, their work just got the job done. They had to handle failures from nodes and move over to non-busted nodes. We eventually actually set up some node pools for relatively long run containers or long live containers that let them live in a happier environment for a while. So there's some real process out there. Like you get the job done, your customers are happy, they give you more money. There's a huge con though. You have a lot more free time, and then your PMs keep asking you to put out features instead of, you know, fighting fires. And that is, you know, it's a downside. 30-second Windows containers, however, from a platform perspective, there's a lot of really fun problems to solve, and you have to solve those very quickly as you get more and more containers thrown at you. So we've had, you know, some pretty interesting times solving these problems at scale as they come up, and big numbers are really fun. I like to see if we can get another comma put onto things sometimes. Con is obviously lots of problems to solve, and a long start of times in the critical path are one of my least happy things in the world. I started my career in high-frequency trading, and we had to get all of our code out the door in under sub eight microseconds. So the fact that it takes us like four minutes to pull a Windows container really made me very sad. Takeaways, please upgrade to container D and grab this H&S patch as soon as possible. If you're using 2022, it's going to be released very soon. 2019, it's not going to be out till November. You guys can go talk to the Microsoft folks and like ask them to move things faster. I think they all have like nice name badges, so hunt them down. They're very friendly. Host containers, if you're not using them yet, they make cluster configuration almost like Linux, and it's very nice. I'm pretty excited to see what other companies start building, and if I can convince some people at relatively to let us start open sourcing some of the tools that we're building instead of just stealing things from the open source community. I'm just kidding. We are actually actively working on making sure that our legal team lets us open source some things. And also, as you're making a Kubernetes migration, whether it's Windows containers or Linux containers, the best way to solve a problem is to not have the problem in the first place. Work around the problem so you don't have to schedule 30 million Windows containers instead try to figure out how to schedule three and get the same amount of work done. Yeah. So I'm actually way early on time, but that means there's lots of time for questions if you want to start asking them. Any cushion? Okay. So I've actually ran into a similar problem of the sort of short-lived containers, even on Linux. There's this sort of anti-pattern I see a lot in sort of legacy production of this whole, like, Tron servers. The server, it runs, it has like a whole bunch of Tron jobs, and this Conja, and it like runs all the different Tron jobs, and it runs them in parallel, and you just don't see a CPU set in, and maybe it works. But when you kind of migrate that into Kube, what I've noticed is like, okay, like, of course the system is a lot more kind of distributed and sane, but there's a lot more overhead for a Kubernetes Tron job compared to like having like 70 Trons on like a Linux server, basically. So what, do you think like there's any kind of solutions, like either potential or like that currently exists that could like maybe bridge that gap when it comes to that kind of massive increase in overhead that you get when you're migrating Tron jobs to Kubernetes, especially short-lived ones? So I think the problem there is that there are some solutions that solve like 80% of those problems, right? But 100% of the problem is with that last 20%. There's nothing that I know of that will make it so you can just take exactly the running in VM, transport it to the cloud without making any changes, and have things work perfectly. You're right, though, that is a thing that you run into in a major way, and I've seen that happen at two different companies that I've worked at, figuring out how to make the changes as you make the migration and instead like running, if you do some performance testing on your workloads and see where you're actually getting caught, if it is the container startup time, if it is like you're getting IO blocked or like what kind of issues you're running into, and try both ways. The cool thing about Kubernetes and like the cloud native world is you can move really fast and test your assumptions, so do it. Get some data. Oh, I agree with that, and I think the old, so the, for anybody who's listening virtually, he was saying, what do you mean it's running every minute, you're going to run into these pains, and I think you're right. The only way to change that is to change the system design, so you don't run into that. We, I didn't go into detail on what our agent framework looks like, but basically we have a ton of virtual machines that are running relativity software that our customers use, and every minute, yes, it's a minute, we have this tool for a region that goes around and checks to see how many containers we need to start, the, to get various workloads done. It checks workload discovery endpoints for all of our various pieces of, of various microservices. Our first shot was to just start up everything that was asked for all the time, and have throttling just in the platform layer, and that does not work. You have to also figure out how to have changes to that cron job portion. Thank you Julian, thanks for coming. Yeah, thank you everybody for coming, I really appreciate it.