 Hi, I'm Melanie, I'm here to talk about what's going ARM, adopting ARM64 at Airbnb. So for the agenda, I'm going to first give an overview of ARM, then the pitfalls and challenges we faced, and finally how we designed multi-arch support at Airbnb. So what is going ARM? ARM is a company that designs the ARM architecture and licenses it to other vendors who then develop their own ARM-based processors. Also companies have worked with ARM, for example Qualcomm, Ampere, Amazon, NVIDIA, Apple, etc. And I want to give a brief history of two architectures. So we start with IA32, which started out as a workstation and PC optimized architecture, and then that evolved into XA64, which won the server market while competing architectures failed to capture that market. ARM32, however, had a foothold where energy efficiency matters, for example, with mobile and embedded devices, but then ARM64 came out. Whereas before ARM32 targets efficiency, now there is a competitive arm design that targets both energy efficiency and performance. So who here has an Apple M1 laptop? Okay, great. So these new advantages combined with licensing terms allows other vendors like Apple to switch out their Intel processor for their own custom ARM-based processor. So if you're a developer that uses the MacBook M1, I'm sure you've noticed that Apple switched completely from their Intel-based processors to their own M1 design. So now developers are exposed to ARM64 as they had to build some developer support for Darwin and ARM in their tool chains. So like Apple, cloud vendors want to create their own processors. Similarly, they work with ARM to create custom systems on chip designs and sell it to the consumer. All major cloud providers now have custom ARM-based VMs, and generally they advertise energy efficiency and performance, and are priced very deeply compared to the other instances. This encourages customers to adopt ARM support to run on cheaper hardware. However, there are pitfalls and challenges with ARM. Okay, so firstly, the architecture you're using and developing for historically is X8664, which is not compatible with ARM64. In the past, you may have switched between maybe AMD and Intel instances, for example, but that's less challenging because the instruction set is pretty much the same. Additionally, the main part of this incompatibility for Apple was masked largely because of their fancy Rosetta technology, which automatically detects and translates X8664 binaries and translates them to ARM64. So that kind of technology, that catchall technology, doesn't really exist for the wide range of software that we need to run on servers, as we can't just run it through a translator. The next challenge is that while ARM64 support is very common in open source software, it's also relatively quite new, so therefore you may have to upgrade and update a lot of software for it to run and run well on ARM64. Another challenge is the high surface area, so anything running on your servers is affected. For example, everything including the base infrastructure, build tool chain, applications, common functionality running on machines, such as security and observability, and more. So this makes the challenge a full stack and cross-functional team effort that requires expertise, staffing, commitment, and coordination. A pitfall we ran into early on is that a lot of our migrations are sort of treated as all or nothing migrations, but for ARM we don't actually need to scope to migrate all possible workloads to ARM. And so most companies are also running the migration this way, and in this end it's actually a benefit because we can target support where it makes the most sense, and we want to target workloads to run on the right hardware based on their price performance needs. So ARM makes sense for some, but maybe not the others. I included a diagram that has price on the y-axis and perf on the x-axis, and on the right I have a few choices of hardware I could run my workload on. These each will land somewhere on the graph. In the end I will choose to run this workload based on what I'm optimizing for and the results of the analysis. And just note that actual price performance comparison depends on many factors such as the workload, cloud vendor, and hardware. Okay. One other challenge is that this is also a performance problem. So this is because ARM64 is in catch-up mode compared to X8664, Intel specifically has for years and years, put optimizations across the software stack and put them in place. So it will take time for software to catch up with the same kinds of optimizations for ARM64, and in the meantime it may require sophisticated performance analysis, tooling, comparison, and tweaks to achieve competitive performance. So finally before I go further, a note on architecture naming. You'll see all of the following combinations and probably more to describe what is basically two architectures. So X8664, X64, AMD64, Intel64 versus ARH64 and ARM64. The reason for this is due to inconsistency in how the architectures are named and references in various parts of the stack. For example, packaging versus Docker images. So I'll probably end up using them interchangeably and that's why. Finally I'm going to give a deep dive in how we designed multi-arch at Airbnb. So on a high level the goal of multi-arch is that our workloads can run on the best hardware for their price performance needs without developers being concerned with the underlying architecture. Note that as you'll notice in this talk, while the goal is for the developers to not be concerned with the architecture, the platform engineers involved such as myself need to do a lot of heavy lifting to make that possible. So designing multi-arch at Airbnb has closely resembled the hero's journey. So first was the call to adventure. The community and our peers began to embrace ARM64 support several years ago and we were pretty excited by what we were hearing and seeing. But then the refusal of the call, we had some initial analysis that this was not quite compelling enough yet and we knew for our infrastructure would be a pretty heavy lift. So with the unclear benefit we sort of took a pause on the project. Some time went by and then we invested in supernatural aid with performance expertise and analysis and we also pivoted to a small focus team to kind of pilot this effort. So now we are deep into the road of trials with many challenges and significant progress toward multi-arch support. So we're not complete in our journey but I figured we have enough to share. So that's what I'm doing. Okay, so first we focus our efforts on one or two workloads that have a business need for better price and performance that really incentivized us to find something that was best for the business. We formed a small pilot group with subject matter experts and we figured out prerequisites ahead of time. So this is what the pilot group looks like. We have a few subject matter experts across the stack particularly in systems, Kubernetes, performance as well as a developer to from the application side. This is the overall plan. First prerequisites, then infrastructure, then Kubernetes, and then running the application itself. And note the performance analysis can and should be done independently of other works where you don't need to stand up all of the arm infrastructure to necessarily kick off an arm build and do some analysis. Talk a little bit about that later. So let's dive into prerequisites mostly in the form of upgrades and migrations. First the good news, nearly all software that we have looked at has ARM64 support. Bad news, you probably have to upgrade to get to it. So this includes the operating system and ARM64 packages for that OS, languages and run times, and open source software. Let's start with the bottom layer, which is the operating system. Choosing the right OS for ARM64 depends on your unique context. For example, your cloud providers, your current OS, which packages you need to run, etc. One pitfall, if you're running any OS using Apple 7, note that Red Hat added 8-Arch64 support. So you likely need to upgrade to Apple 8 to get that package support. This is a brief analysis I performed on OS package support for some of the popular OSs. Older OSs, such as Ubuntu 1804, don't have robust support. And they'll go out of support soon anyway. Amazon Linux 2, like other OSs that use Apple 7, like I mentioned, don't have that third-party support. So the latest of these and other operating systems, for example, Ubuntu 22.04, Amazon Linux 2022, which is based off of Fedora Linux. These appear to have up-to-date packages and ARM64 support. So because ARM64 support is a big lift, we don't recommend switching your OS from one distribution to another. So pretty much just pick the one you have and upgrade it. It should probably be fine. So note that since we've selected our OS, we need to be able to build, upload, and sign packages. Like if you have internal packages or third-party packages you want to mirror. So at OpenV, we automate that process with CI CD jobs. This is what publishing packages internally could look like. I have two examples, one for RPMs, the other for Debian files. So you store your packages in Git repos, and Git push triggers a CI job, which actually builds and deploys each of these packages. Because these are architecture dependent, CI jobs are dispatched for each architecture to CI workers that are running on that architecture. These are pushed to internal repositories in our package repository, which stores it in the cloud. So you can easily support building your RPMs on all platforms, but some might take some work to actually build properly on ARM64. So one way we did this is we just added ARM64 support to just automatically dual-build everything, but we excluded a few packages to start with this exclude ARCH, AAARCH64 flag. Then we went back and added ARM64 support one by one for packages where it was challenging. Debian support also supports architecture agnostic packages, as well as multiple architectures in the architecture section. So you should check to see if you're missing ARM64 in this list. Also, this has happened to me a few times. ARM64 and AMD64 texts look pretty similar, especially if you've been staring at it all day. So if you're looking in this list and scanning, just make sure you see ARM64, AMD64 is the other one. So you probably need both. Okay, so next you need to check the language software you're running and update for ARM64 support because it is relatively new. Some language support is more of a challenge than others. For example, commonly run older versions of Ruby and native Ruby gems, and commonly run older versions of Python and Python wheels. Okay, here is my best effort analysis of several common computer languages, the minimum officially supported version for ARM64 and the latest stable version. No, I wrote minimum officially supported version, but there are definitely instructions online for folks that can run ARM64 on older languages that are technically out of support. Not necessarily recommending it, just I found it on the internet. Okay, finally, you need to update open source software. This list is much longer than what I put in there, but I gave just some examples of open source software. You may need to update ARM64 support. Istio, Envoy, and Kubernetes all added support pretty recently. So this can be challenging. Who here has done a Kubernetes upgrade or an Istio upgrade? Yeah, so if you have to do a few of those, that's something you need to schedule, takes a while. For those that do those upgrades, who has trouble keeping up with the latest version? Yeah, so with prerequisites done, we can move on to the infrastructure. So for us, infrastructure includes Infra as code, such as Chef and Terraform, build, test, and deploy support, missing tools, agents, and daemon binaries, and missing multi-arch container support. So let's start with Infra as code. There was basically some work we needed to do just so we could launch an instance in our infrastructure. So we did it pretty iteratively. We tried to launch it and it failed, like the AMI was wrong and like certain scripts were wrong. Then we launched it and it was running in a healthy state. And so we needed to update our instance launcher logic, some of our provisioning scripts, our packages, and our machine images. A note on the provisioning scripts or just writing multi-arch software in general, you may have previously hard-coded things like pulling AMD64 binaries, and now like to update a script like that, you need to have logic that sort of inspects the architecture and pulls the right binary for the architecture that the script is running on. So I use uname-m or arch. One fun gotcha, I don't know if it's actually fun, but if you have any script that runs on developer laptops and it runs on servers, beware that these commands actually return different things. You know how I talked about like naming, not being standardized? So on an M1 MacBook it'll return arm64 and on a server it will return a arch64. So I got hit by that. Okay, so next you need CI CD support. Basically you need to be able to run CI CD jobs on arm64 machines, and you need to be able to write and dispatch arm64 jobs. For the first part, being able to run on arm64 machines. So this is pretty easy if you use a CI vendor. Who uses a CI vendor? Okay, who like runs and hosts their own CI? Okay, I feel bad for you. So it's gonna be a little harder if you try to do ARM. The reason for that is that the CI vendor probably did, well if they support ARM. They probably did all the hard work themselves of provisioning their CI workers getting them run on ARM. So if you host it, you have to do it. That's what I did. I updated our CI to work on arm64. So once you have a willing and able ARM64 CI worker, then you just need to update your jobs to be able to dispatch to them. So dispatching to different platforms. Usually it's configured via an option in the CI job. In our case we added a platform option. This isn't actually what it looks like. This is just the basic idea. So you can specify a Linux ARM64 CI worker in the job and it gets dispatched to, you guessed it, a Linux ARM64 CI worker. And then from here we just need to determine which jobs need to actually run on multiple platforms. And so this came up quite a bit for building common binaries, packages, builds, which I'll talk about soon. But while we're here I wanted to talk about some more gotchas that I ran into when it came to multi-arch builds. So problem number one, if you have two different architecture builds and you have build caching enabled, CI will try to save time by checking the cache if you've already built it. Problem is, if your cache is not actually encoding architecture information, you can upload an ARM64 build and pull down the ARM64 build on an x8664 worker. These architectures are not compatible, so it'll blow up. Same problem can happen with container images. If you have two different architecture images, image builds, it's the same cache issue but with the Docker image cache. So you need to add platform information or disable the cache temporarily, to add the platform information and get it all working for your CI workers. Similarly, otherwise you could accidentally pull down an ARM64 image from the cache on an x8664 machine. It'll run the first command and then it'll blow up. So you'll want to avoid those if you have build caching enabled. So now that we have CI-CD support, we can start building on it. At Airbnb, we already have a toolchain that supports building artifacts, tagging them with build tag metadata, and then later pulling them down based on that metadata. So with this schema in place, we can update the toolchain to support multiple platforms. So first we build a project for our target OS and Arch, then we update the build tags to include the platform information for OS and Arch. Finally, we update the clients to pull in the right platform artifact based on what they're running. Just a side note, in this example, to make it easy, I'm cross compiling Golang code for a developer tool, but in general, don't cross compile any important Golang binaries with C dependencies as these could unexpectedly fail and crash during runtime. Has not happened to me, but someone told me it happened to them, so you don't want that. Okay, so in the history of supporting multi-architectures, I found another neat schema that's sort of the opposite of this, which is this idea of multi-architecture or universal binaries. This idea actually came from Apple. It's a format for executable files that supports multiple architectures. It's also known as a FAT binary, and it's basically designed to store all of the architecture binaries, and it selects the correct one based on the architecture that that executable is running on. So this pattern's pretty tempting and I did see it cop up as well. So if you were already building binaries for MacBook laptops and for Intel servers, for example, you might have been building all the binaries, writing a wrapper script that selects and writes the right binary for the OS, and with M1 and ARM64 servers, you could update the script to add more platforms, and then also with ARM64 CI workers, you could update it again. So now we have the full matrix of OS and architecture. So the upside of this is it's relatively straightforward to write. It doesn't require making changes to the client. It just sort of introspects what should I run on and fix the right binary. The downside of this approach is that you are now storing quite a few binaries in your executable, which could have significant storage costs or other impacts on the client, and the indirection logic itself can be kind of brittle, so it needs to be updated every time a new platform is added. However, I think it's interesting to talk about these two different approaches and the trade-offs of each, because they've both come up. Cool, finally, I want to talk specifically about containers and container images. So containers cannot run cross-platform, just like binaries, and you can make platform-specific images, but there's also a nice abstraction around multi-arch images that I'll dive into next. So who here has heard about Docker image manifests, like manifest lists? Yeah, okay. So basically, this is a way to encapsulate our cool project image build and point to multiple images, one for each platform. In this case, our manifest list has two manifests, one for Linux AMD64 and the other for Linux ARM64. So the entire manifest list, which references both images, is pushed to our container registry, and now we can pull our cool project and do the manifest metadata. It'll pull down the correct image for the platform of the client. Nice. This approach does require some work to set up if you're building your own multi-arch images, however. So to actually do this internally, so for each platform, you need a dispatcher CI job to build the platform-specific image on the platform-specific CI worker, push that intermediate image to the container registry with a unique platform tag. Then you create a new CI job, which depends on both of those jobs, pulls down the platform-specific images, which are built on their platform-specific workers, and it bundles them together, creates a Docker manifest, containing those images, and pushes that back up to the container registry. So we sort of avoid that image and image cache behavior issue from before, but I noticed that if you're not careful, you can still sort of mess this up. So one thing I noticed is that because image tags are unique and mutable, you can accidentally clobber an override tag which you're pointing something at. So in this example, I went through all that work to push my super cool multi-arch manifest to my cool project shot tag, and then later, a rogue platform-specific build is pushed to the same tag. The behavior I observed, at least with Docker and ECR, is that upstream in the registry basically just removes the tag from the manifest, which is now untagged, and it uploads the platform-specific build with the same tag. So later on, my client, which is perhaps expecting my fancy multi-arch manifest, actually pulls down the platform-specific image and blows up. So what I was trying to accomplish is seamlessly replacing the previous logic with multi-arch manifests, but you just need to be careful if you don't clobber on an accent. So that's something I ran into. So a final note on container images, something I find kind of fascinating, is that if you inspect a machine for architecture with arch or uname, it's different than what I'll call normalized architecture that containers use for platform metadata. So upstream code, which I found, and likely your own image-building code will need logic just like this that translates to the normalized architecture that Docker recognizes. Okay, so with the base infrastructure work down, we can move on to Kubernetes. Actually, the question, who here runs their own Kubernetes? Okay, and who here uses like a hosted Kubernetes solution? Okay, very interesting. So again, it'll be easier if you're using the hosted one. It'll be a little bit more talented if you're running your own. So Kubernetes support involves for doing it internally, updating the inference code, updating the binaries, creating multi-arch images. It's pretty much all the same tools we've already developed, but for Kubernetes specifically. So the main thing I noticed about this is that at least for us, we need to build a lot of multi-arch images and I'll tell you why. So a quick recap on a Kubernetes pod, who here actually uses sidecar containers? Okay, cool, a lot of you. Well, so the downside of that is that it's like, oh, we just need to run Java on ARM. Okay, it's Java, so that should be easy. You're like, oh wait, let's look at our Kubernetes pod. Oh right, we have all these sidecar containers. So for each of these, we actually need those containers as well to build on ARM and to upload a multi-arch manifest. Okay, oh, and then for those of you who aren't familiar with sidecars, it's just like a way to run common logic, like observability and security, et cetera, in its own container. Okay, so also on the node, we have a lot of pods running that do Kubernetes logic, like cluster scheduling, CNI, other third-party functionality, internal functionality. So these also all have to be migrated to multi-arch images. So for Kubernetes cluster to support multi-arch as well, we need each node to pull in these multi-arch images. And we also need each cluster to support different instance types, including architecture for every node. So if you have a homogenous Kubernetes cluster that only supports a hard-coded instance type, you're obviously going to need to update that to run on another instance type. So we did that. We also had to add ARM64 Kubernetes binaries, configuration, and machine images. Okay, so testing out the new ARM64 nodes also requires a little bit of careful coordination. Who here is familiar with the node schedule, taint you put on a node? Cool, yeah, so it's basically just that. You taint the node, you put node schedule on it, and that way you can launch and test on this node, and Kubernetes won't schedule a real workload onto your node and cause chaos for your customers. So we do that, and then we also can tag the nodes and the auto-scaling groups and things with the right metadata, and that's sort of how we actually roll it out. Okay, so finally let's talk about the Multi-Arch application. So running a Multi-Arch application requires reproducible ARM64 build for performance testing, which I mentioned at the very beginning, as well as updating the application itself to run on ARM64. So that could include library dependencies. So we limited the scope of the application work by focusing on Java services. It works out really quite nicely. The applications written in other languages would require a heavier lift to run on ARM, for example, like C or Ruby. So we also added the Multi-Arch build job to the application and conduct performance testing and analysis. So much of this application work is focused on achieving superior performance. We have a tight feedback loop between a performance engineer who's building, testing, and running analysis on an ARM64 build. They then share findings with a cross-functional group which uses the feedback to update the build. Now that this work does not fully depend on ARM64, we actually invested a little bit in a reproducible build. So we didn't have to depend on all this internal infrastructure being built just to performance tests. And so that way we could parallelize these tracks at work. So one example early on is the performance engineer tries to build the application, but a library fails to build. The application engineer looks into it, they upgrade the binary for ARM64 support, and then the prevent engineer can try again. So once it's building, the prevent engineer runs an ARM64 image that the third party provides and they share logs and analysis. So the third party infrastructure engineer narrows down performance issue, updates their images, and provides new instructions. So the performance work was pretty much just very iterative like this. Okay, so that's actually it. So here's a summary of how we're doing multi-arch at Airbnb. Like I said before, the actual work varies depending on if you're using third parties versus running your own for a lot of the infrastructure. We talked about the prerequisites, the multi-arch infrastructure, the Kubernetes work, and then the actually application work. And here's the takeaways. So ARM64 is in the cloud, it's in all the clouds, and it's cheaper. And that's why we're doing the work. It's not an older nothing migration. So we focused on a single workload, focus on the capability, don't try to migrate everything. But you probably need to upgrade and update all the things, which you probably want to do anyway in order to get ARM64 support. Build all the things multi-arch, that was the binaries, the images, the builds, and invest in performance pooling and analysis. Quick shout out to the folks at Airbnb who helped work on this very valuable team effort. And if you want to join the team, go to abrambe.com slash careers. You can also reach out to me on Twitter. That's all I have. Thank you.