 Hi, my name is Liz Fong-Jones, and today I'm joined by Tomo to talk about how we can improve the environmental impact of our services and to have that take as little time as possible thanks to the flexibility of get-ups. So why should we care about this subject? Well, I think it's because of two reasons. Number one, our planet is on fire, and we would ideally like to have fewer flash floods that destroy towns and fires that destroy towns. And also, we would like to save a little bit of money for our companies along the way. So all of these things are happening in an environment where we're also being pressed to do more with fewer people on our teams and with fewer cloud resources. So this may seem like we're trying to get everything all at once, that we're trying to get both more work out of people and to decrease cost. And this may seem impossible, but the answer is yes, you can actually have it all. So here's the recipe for how Honeycomb did this. And then in a moment, Tamao will be telling us how you can apply this in your own environments with a lot less work than we went through. So the ARM64 processor architecture has been commercially available since the end of 2019, beginning of 2020. And that was with the announcement of the Graviton 2 processors available in Amazon EC2. And since then, it's expanded to a lot more cloud providers, basically any major cloud provider will support this, whether it be Azure, whether it be GCP, whether it be Oracle or Equinix, all of these cloud providers support it. And many of you have Mac OS devices that are actually ARM under the hood. So I was originally a little bit skeptical of this because it was December of 2019 and Amazon did the song and dance up on stage and said, you know, oh, right, like this is going to be great, it's fantastic, it's gonna save you money, but you might need to configure and change your software. It's like, okay, well, if it says on the tin that it's going to be 30% better, that seems like it might be worth it because it turns out that a large portion of Honeycomb's costs relate to our cloud bill. But why isn't this just the default across the board? Why isn't everyone using ARM64? And the answer is that it took us a while to get here. That for a very long time, there was a proliferation of architectures in the 1980s and 1990s. How many of you remember using PowerPC Max? PowerPC Max, yeah. So it didn't used to be all x86 focused, but eventually x86 won in the early 2000s and everyone was just using x86 as their default, all the scale out software and hardware's built x86 in mind. But what changed was two things. Number one, mobile phones came onto the scene and demonstrated there is a need for more power efficient compute and they were willing to rewrite how software is developed. And number two, we stopped having dependencies upon purely closed source platforms and we started to have the ability to compile our own software from the base level upwards to get working systems that we could build on. So how ARM achieves the results compared to x86 relates to its efficiency. ARM is fundamentally more power efficient and therefore lower expense to operate for cloud providers as well as having lower licensing costs. And that's because it has a simple or reduced instruction set as opposed to x86 which has a complex instruction set. If you've ever looked at the manual of x86 assembly instructions, it's a phone book. It's actually two different phone books stacked on top of each other. And it turns out that in order to translate that into what the processor is doing under the hood, it needs a lot of its dye dedicated to doing that translation. Whereas an ARM processor can just have a one to one correspondence of decoding instructions to executing. So when you're buying one virtual hyper thread on x86, you're actually getting half of a real executing core. And when your system gets more than 50% saturated, it turns out that your latency will really, really severely degrade on an x86 machine. So by kind of saying we're going to throw out the rule book we're just going to go with a simple reduced instruction set and we're going to have this one to one correspondence of virtual CPU to executing cores. ARM has been able to deliver better performance. So Honeycomb has been an early adopter of this and this has really enabled us to scale economically. So I'm speaking to you in my capacity as field CTO at Honeycomb and I had to insert a good fun little joke in there. So was it worth the risk? Fundamentally, Honeycomb's job is to transmute data collected using open telemetry which is a sister project under the cloud native computing foundation. So our customers send us millions of data points per second about what's going on in their systems. And our job is to ingest that data and to make it queryable at runtime as quickly as possible so that users can debug what's happening inside of their systems. So fundamentally we care about, yes, lowering costs but at the same time we cannot break the system for our customers. That means that our homepage has to load quickly. People have to get a quick at a glance view of what's going on in their system and that user run queries have to be very fast. Faster than taking a sip of your coffee. Definitely not, you know, I'm waiting and going to make a fresh pot of coffee. So by adopting ARM64, our hope was to be able to lower costs, improve performance and to make sure that we're continuing to deliver great reliability. So how did we actually do this? Well, we wanted to make sure that we were going to be able to interoperate for a while and run both architectures in parallel and eventually condense onto whichever one made the most sense. Spoiler alert, it was ARM. And we did this all while growing by a factor of 10 both in terms of data ingest as well as in terms of the amount of queries being run against Honeycomb over the trailing three year period. So how did we adopt this while kind of, you know, while the car is moving and on the road? The answer is we started in the lowest possible risk environment. For instance, internal dog fooding environments rather than full production environments. We also looked at our mixture of services and we said, we're going to pick the services that are stateless rather than stateful where we get a chance to do things like gradual drains, traffic shifting, things like that that enable us to get a second chance rather than oops, we corrupted the data it's gone. So Shepard for us is that ingest service that transmutes incoming open telemetry formatted data from our customers into and stores it and persists it into Kafka. And the cool thing about it is that it's a real asset test of whether or not a compute architecture is up to snuff because of the fact that it is doing heavy compression and decompression workloads and parsing and kind of all of those things that really exercise a CPU. So we wanted to make sure that we are going to run a good experiment. And that meant that we needed to first build artifacts for both architectures. So we wanted to have build artifacts, we wanted to have base images and we needed to make sure that we didn't have anything that was going to bite us in the rear end that we didn't expect. Things like, oops, this is only assembly optimized for x86 and the performance is 10 times worse if it's not x86, right? Like so we wanted to look out for landmines like that. But fortunately it was pretty easy for us to do that because Honeycomb uses Golang. And the cool thing about Golang is you don't have to install a cross-compile tool chain. All you have to do is set a single environment variable, go arc, and that will enable you to cross-compile right on the spot regardless of what the host architecture is. It's even easier for Java and Python because you don't even need to compile your binary any differently, it's just one architecture independent executable. Asterisk there for wheels and stuff that we'll get into that. But do not pick C++ as your first thing that you want to try running on ARM. So assuming that you have built artifacts that are built for both architectures, either as binary files or as Docker images, then the next step is to go ahead and actually try running it in production. So that's what we did. So we went ahead and said, okay, we're going to try running one machine, then we're going to try running 20% of machines on ARM 64. That first machine was actually pretty easy. That just took a couple of afternoons of reconfiguring our build system. But then things got a little bit hairy because while we do practice get-ups at Honeycomb, at the time in early 2020, when I was doing these experiments, we were not using Kubernetes at all. Everything was managed using Terraform Cloud, which meant that, yes, we were doing pull requests to update things and then you would go ahead and have Terraform Cloud synchronize the EC2 instances and auto-scaling groups. So when you went from one instance to suddenly needing to tune up to 10% or 20% or 100% of instances, oops, that didn't work, scale it back down, right? Like suddenly each one of those things was waiting on a full Terraform read the state, synchronized with production and apply cycle. So it was very slow and a little bit painful. But we did get there. It just took a lot of effort. And at the end of the day, we could use Honeycomb itself to measure, to understand what was the effect of the latency? What was going on in production? So specifically what we saw was that the latency on the ARM64 instances stayed absolutely rock solid flat regardless of what the load was on the instances. Whereas the X86 instances were seeing up to 10 or 20% degradation in performance as load increased throughout the day. So was that a sign that we were only getting about 10 to 20% performance improvement rather than the 30% that was promised? No. That was a sign that we were not pushing it hard enough. Here's an example of the CPU plot showing that in our CPU utilization, that an X86 instance running at steady state will max out at about 85% utilization before latency starts to degrade. And that there's a significant difference between the maximum CPU utilization versus the median. So there's only so much that you can push those instances. For comparison, this is what, there we go. For comparison, this is what ARM looks like. Whereas the distribution of CPU utilization is much narrower, which means that you can drive up the median CPU utilization without maxing out and causing requests to stall out. So what we had to do to really thoroughly validate and test and iterate using GitOps was to actually try raising the auto scaling targets, taking away instances until latency started to saturates that we could really measure the performance improvement. So that's what we did. And at the end of the day, we were able to run something like 20 or 30% fewer instances and each instance took 10 or 20% less dollars per CPU hour. So I think that's pretty incredible, right? To be able to kind of take your architecture that's running originally for only one architecture, to do a trial with a different architecture, scale up and down from 0% to 10% to 100% and then back down and then to eventually turn it on in production. So in June of 2020, Amazon announced general availability and within a month or two, we were able to fully cut over not just our dog food environment but also cut over our production environment as well, seeing on average about 30 to 40% price performance improvements. And then we did it for the rest of our services. We were able to give more query throughput so not focusing on minimum price but instead on ensuring we could run the same number of instances and scale up the amount of queries that we were running by a factor of two or three over the past two years without having to increase the number of worker nodes. And then because of generational improvements, you know, sure, you have Intel six gen instances now that are replacing Intel fifth gen instances but the seventh gen M7G and C7G instances are quite good. Because compounding the effect of going from C5 to C6 where we saw about 30 to 40% improvement, we saw another 30% improvement going from C6 to C7. So this means that the amount of requests that any one machine can handle has increased by a total of about 60% relative to fifth generation Intel. The latency is dramatically reduced and we're able to actually run things as hard as we can in order to keep the cost low for our customers and to make sure that customers get the results back as quickly as possible. So having done all those experiments, we said, okay, we don't need the complexity of running both architectures at once. We can just turn off all of the x86 servers. Everything is armed now at Honeycomb. With one exception, which is sometimes you have capacity constraints, especially in AWS Lambda. So we've kept kind of a feature flag to toggle the amount of Lambda executions but it's set by default to 99% RM64. But that took a while to do. Right, sure. Compiling it and running it for one machine, that took a few afternoons. But getting it running across all of production, that took several weeks to several months to actually get running and roll out across all of production. So now I'd like to hand things over to Tomo to talk about how you might try one of these experiments and have it take not weeks, but maybe a day or two. Cool, thank you. Such perfect timing. So I'll put my timer here. So what we're really excited here to show is that GitOps with Flux, here we are at GitOpsCon, when actually Liz was saying, this activity that took many, many months, you might actually be able to complete in a day if you have this correct setup. So we're really excited to show this and are sharing from people in the community who are starting or on different parts of this journey. So some basics. My name is Tomo Nakahara. I work at the company Weaveworks that created GitOps, which was inspired by our key project, Flux. And so we've kind of gone on this great journey because this has led to now this type of event, GitOpsCon, and it's usually at KubeCon where a lot of people are still new to Kubernetes even there. I'm just gonna cover some basic things and even for those of you who might know this, hopefully it'll add some nuances, especially within the context of Arm64. So I'll kind of cover what's GitOps, how does it work with Kubernetes, especially with Flux, why you should trust this and how it'll work for this. So GitOps is a term that our CEO, Alexis Richardson, coined in 2017. We had created Flux for our own needs as early adopters of Kubernetes, but we're an open source first company, so we had it out there. And around February of 2017, our CEO put out a blog post saying, I'm noticing as prevalent as Git is, to get the open source version controlling software, people are using it for perhaps this concept of ops. And that was in February and about May that year, I went to the first Helm Summit and it was really amazing, everybody was just saying, GitOps, GitOps, GitOps, as if the term had existed for many, many years. So I thought, wow, this is really catching on and I think there is even a tweet of like, oh, I had a really complex term for that. I should have thought of GitOps. It's like observability, right? I don't want to say it now. Exactly. And so one of the things we often describe it as is operations by pull request. And it's because as Git was becoming really successful, maybe people thought about it more for app deployment, but as you'll see in our world, infrastructure teams are seeing its benefits to use for operations. And so I'm really excited, Brendan Burns, our friend and co-creator of Kubernetes, in fact, has a talk tomorrow talking about how Flux and GitOps are the natural evolution of Kubernetes. And it's something we've talked about together with Brendan. So why is that? So I'll explain some of the basics, because even people who are in our space sometimes get this a little bit confused. And so add a note, Liz was adding that, there are different flavors of GitOps in terms of sort of reconciling with Git, but Liz's quote, Flux is responsive and faster, especially in this example. So we're excited to share it. So here's where they're the different parts. So Kubernetes does the reconciling, right? So it takes what's in the manifest, often a YAML file, as the single source of truth, as the state of what it should be, and it'll make the change. What Flux does really well, is that it keeps an eye on the repo that has that YAML file, and when there's a change to it, it says, hey, Kubernetes, we need to start the GitOps process. So there's various aspects of it. And because of that, it's a pool-based model, and there are a lot of security and other benefits that come with that pool-based model, but at the same time, since Flux is kind of checking off in every minute, we're taking lots of steps to make sure that there are different ways that we can minimize the resource consumption. So there's the different movements going on right now, like if you know that everybody in Europe is gonna be gone for a week, and there won't be any changes, are there ways that we can alert that for development clusters? And definitely, for example, with ARM64, we're really excited about that. So the key thing here is that Flux ships with multi-arch images, so ARM64, ARMv7, and Intel AMD64, which makes this a much easier process that will show. So basics, if you haven't heard of Flux, we've shared a little bit at the keynote how it was really important for us as this thing that we built for ourselves and put out in the open was gaining traction, being used by some pretty serious companies. We wanted to make sure that it was part of our key strategy to make it legitimate within a foundation. So our company is part of creating the Cloud Native Computing Foundation, which is in the Linux Foundation. Flux is a graduated project, which is the top level if you don't know what that means. And these are just some of the logos of the types of companies that can trust Flux because it is engineered fantastically with scalability, security, and such, and many other like financials and telco companies that can't add their logos deeply, deeply using Flux. And what's really exciting is that you might actually already be using Flux, you just don't realize it, because the key Cloud providers like Microsoft and Amazon already have Flux as their chosen GitOps tool to give GitOps to their customers. There's a talk, I think, later about Git Labs now used as they'll be developing more, having Flux be their embedded GitOps tooling and many, many more. The company that I work for, Weaveworks, we have Weave GitOps. And what's really important about our open source strategy is that sometimes you have open source that's kind of made weaker so that the commercial seems better. Well, in our case, Flux is completely full force. A lot of the users will testify to that. And it's also extremely unopinionated. And so that's why people come to people like us or Microsoft or Amazon, whatever, to have the abstraction layer that we need, that they need to do it safely. Or like in our case, we have seven, eight years of a Kubernetes experience that's baked into how people can use Flux. So those are the things that like, if you want the open source, it is full, full force. And yeah, please check out the QR code for all the talks that we have this week. And I'll do a quick mention. Progressive delivery is a fairly new term. It's an umbrella term for like Canary, AB, Blue Green. So we don't have to repeat that. It's becoming a common term. So within Flux is a sub-project that does that, that doesn't require Flux, but is optimized for Flux. And the reason I bring it up is that Flagger, which is the sub-project that provides this, sits between Flux and Kubernetes, such that when Flux says, oh, I noticed that there's a change in the manifest, Flagger can be there in between and say, okay, well, let's redirect traffic slowly and we'll also take metrics from things like Prometheus or Datadog, whatever it is, to make sure that the deployment is successful based on your thresholds. Like, okay, it looks like it's going well, another 10%, another 10% until you complete it. And if it isn't successful, if some things arrive, then you can roll it back. So especially with Flux. Which would have made my life a lot easier, as opposed to manually switching over 10 or 20% of traffic at a time and doing one to get committed at a time, by kind of saying, this is where I want to get to, system, just run it itself and roll back if there's a problem. Yes, so that's why lots of people and other speakers, they'll be talking about how excited they are. So we're really glad that we have that as part of Flux. And so one of the questions I had is we were talking about this. So would this be just to migrate to ARM64? Would there be examples in which maybe you move back and forth? And I think Liz was saying, yeah, like maybe you have some compute loads you're mentioning Lambda that might, you might want to move back. So it's, what makes get off so easy, and I'll show in the demo, is that it's a fairly quick and basic process and so you can have that kind of flexibility. Cool, so with that I will go to the demo and just a quick caveat here. So this demo and this example does require that you make sure your CI is set up with multi-arch. So that is one basic thing, but once you have that, and I'm going to move to a recording because there is some delays and stuff and you don't want to be relying on internet, sorry, wifi internet. So I'm going to speak to this. So thanks to Kingdon in the room, if you guys have any questions, Kingdon helped put to this demo together which is on EKS clusters. And so the key thing that we have here is that with GitOps, you're going to have a fairly simple way in which you pin to the architecture that you need. So this screen shows that we have the, sorry, it's going to build out here. You have, sorry, I'm waiting for it, two instances. So one is m5.large and the other is m6g.large and the m6g one is for ARM64, G standing for Graviton. And then there's a little bit other stuff here like auto-scaling if you need it. And then we kind of point out here in the demo that you have the Git provider that we'll show up right here that just installs flux as you're working on this process. Yes. So now what we'll show is how you'll just make this basic change within a YAML file. And quick shout out, Kingdon here and another person on our team. You also built out this GitOps extension which makes it so that developers can do very complex GitOps deployments directly from VS Code. If anybody's excited about that, come talk to us. That's one of the things that's been growing really quickly in our downloads and in excitement. So okay, so here you can see that we have the image that we highlighted in the beginning. So we've got x8664 here. So these are our two nodes. And then to do this in a GitOps way, we will go into the YAML file and make a quick change. So here we are. In this case, we have a customization.yaml file. There's a bunch of other stuff here. But the key thing, I'm sorry, I was trying to focus on this particular part as you can see at the bottom in the middle here. Now it's values and you can see it's AMD64. And now we simply change the YAML file to ARM64. And since flux is listening to the repo, it will then start making the change. Now earlier I mentioned that we say that GitOps is operations by pool request. So obviously in a proper environment, you'd be using something like GitHub or GitLab, whatever where you have pool request methodology there. So you can have approvals. You can have audit logs. You can have all the good stuff. But here we're just doing it with git push, which then makes the change. So with this now we'll start to see the change moving over. I'm sorry, they are moving off of the AMD loads. So you'll see a little bit of spike and then they'll be moving on to ARM64. So here we're showing the pods as they're showing a change. And then eventually in this demo, we're gonna be pooling data from Prometheus so that you can see those metrics in Grafana. So here you've got the workloads. Sorry, am I looking at the workloads? Yes, so this will show up in a second. Among the workloads, one of them is Prometheus. So here you can see that there's the shift and the GitOps process going through. I'm probably talking a lot faster than I did when I practiced. Okay, so here we have the Prometheus in this particular case. That's the data that we're capturing. And that will help us see the change in data. So here we skipped ahead. So this is like, I think it was like 30, 40 minutes. So you can see here that on the right, the memory usage was at a certain level. Of course, there's a spike in a change during the GitOps shift. And now after that, you can see that the resource usage has gone down. So even with this tiny little demo, you can see the drop in resource usage. Obviously, Honeycomb have talked about their experience. And we're really excited. We have our friends at Zscaler who are Flux users who are just sharing with us. I think you said 20 to 25%. You're in the process right now of making that shift. So if anybody has questions for them, they'll be speaking later today. And I think a few of them will be here for the rest of the event. It's very workload dependent. So try it for yourself, see what results you get. Yes, absolutely. So I will pause this here. I had another part of the demo in which you could do this again using the progressive delivery where there's a canary deployment. But I think it will push us a little bit over time. But again, the concept is that you would have Flux alert Flagger and you would go through this whole process but you have that sort of safety net. Because you don't wanna go from 0% to 100% like we just did, you probably want to add some guards and stuff. Yes. And I think there's a Yvonne from Ring Central who's speaking at GitOpsCon as well. And they had spoken in the past and just said, like, if you can get Flagger in there, it really not only makes it safe, but because it makes the experience safe, he said that just their adoption of not only GitOps, but their adoption of Kubernetes was really accelerated. Because everybody's got their different teams and their developers and everybody's gotta learn, right? So there can be challenges. And the more the developers knew that they could experiment safely, it really accelerated their Kubernetes journey. So with that, we will go back to our slide. So that's me. Like I said, it's the QR code. Weave.Works is our company. We throw out Flux if you have any questions and you are also promoting. Please also go check out the book that we wrote at Honeycomb called Observability Engineering, where we talk about a lot of these architectural decisions that went into how we modularized things and how we were able to migrate pieces of them onto more efficient architectures. So if you search Honeycomb Observability Engineering, it should pull up a free copy to download a link to download a free copy of our book. Great, awesome. So with that, I think we'll close. And if anybody has questions, I think we're out of time. Yeah, just come find us down here. It's Munch anyway. Yes, thank you. Thanks so much.