 Hi, it's a big room, big crowd. I think some of you will find that some of the text in my slides is just a little tiny. You can always grab the slides from the app and kind of review them later. All right, so I'm Mario. I'm an engineering manager at Meta. And I will talk to you about our data center accelerators today. So a little bit more about who I am and what does my team do. Like the introduction said, we're part of the firmer team. And basically, with our many partners, we work closely to deliver the data center accelerators for Meta. Being a firmer team, our primary objective, as you would expect, is to deliver a firmer that, on time, works well, meets all of the expectations of all of our many partners. This may seem somewhat obvious, but we have many partners, many, many partners. And they have many expectations. Don't tell them I said that. Our focus, or my team's focus specifically then, is on testability and maintainability of software, and then on the developer experience of these same firmer engineers. So some of the biggest challenges in that space then arise from having to smoothly integrate what you all think of as best practices in both ASIC design and firmer development. A lot of things are kind of obvious to you. But matching that with existing meta infrastructure, existing software infrastructure, as well as existing engineering culture, is not always obvious or easy. And so that's kind of where we focus a lot. And what we do is serious, but we don't always take ourselves seriously. And so in that spirit, I included here a picture of two pineapples in space. It's an inside joke, but I also put it up there to kind of test my legal department, see what they're going to say about the copyrights. Because this was AI generated, so I was just kind of curious what they're going to say. They didn't say anything, so it's still in here. All right, jumping straight into what did we actually build. So I can talk to about two accelerators here. On the left here is the meta training and inference accelerator. And then there's another one we built called a scalable video processor. Having trouble, I have to read these names because those were not the code names, like at all when we were working on them, but they're now the names of the accelerators. So the NTIA one is the first generation AI accelerator sort of designed for our internal workloads. It's an inference accelerator. I know there's training in the name, the first and focuses of inference. And it was fully co-designed, sort of a full stack solution all the way from silicon up to PyTorch and the recommendation models that we run on this thing. As you would kind of expect, AI workloads are fairly ubiquitous at meta at this point. And this is the accelerator that's meant to work well with those workloads that we run. So we believe, we strongly believe that we can achieve good performance out of this thing, even better performance per watt, and that we can really crush it on performance per dollar. That's kind of where we're heading with this. On the video processor side, this was actually the first in-house data center accelerator that we did, and that one is focused on video transcoding. There's a lot of videos that we serve, whether it's video on demand, live streaming, or short form video. And as we serve those videos, we try to match users' connection speed. And so while we're matching the connection speed, we have to transcode the videos on the fly, and that can be done much more efficiently, much more energy efficiently using the accelerator. All right, jumping straight into the architecture of the MTIA, the inference accelerator. So all the hard work is really done by this eight by eight grid of processing elements here on the left. And each of these PEs is actually equipped with two processor cores, and one of them has vector extensions, and there's a number of fixed function units. Both of these processor cores are based on the RISC-5 ISA, but they're also customized. They're customized to perform the necessary compute and control operations for our purposes. If you've seen architectures like this before, you kind of know a lot of the gains really come from the PEs, being able to talk to each other directly, talk to its neighbors directly instead of fetching data from memory every time. But eventually you have to go to memory, right? And so there's this memory subsystem on the chip that uses LPDDR5 and can scale up to 128 gigs of memory. There's also some SRAM on the chip for high bandwidth, low latency, very commonly accessed data by the PEs. And then finally, there's this dedicated control subsystem. It's the very little tiny thing in the corner, but this is what runs the system firmware, right? And this firmware manages the compute and the memory resources, talks to the host, talks to the host driver or PCIE orchestrates all the job scheduling and all of that, and that firmware was written using the Zephyr RTOS. The video processor architecture is a little different, a little bit more of a mess. Don't tell them I said that either. The majority of it is just video transcoding, as you would expect. It's the core IP that does the hard work. There's also memory subsystem using LPDDR5 because we like to reuse things, but it's a little bit of less memory. And then finally, there's a multiprocessor CPU subsystem that then boots the accelerator and runs firmware, does similar things, schedules, schedules jobs, following whatever the host is asking it to do. That firmware, again, builds on Zephyr RTOS because again, we like to reuse things. All right, the software stack. I said that this was co-designed with a software stack. The software stack is complex, I guess. An understatement. At the very bottom of it is the firmware that we work on. At the very top is things like PyTorch and recommendation models and the applications. And there's been years of many, many engineer years put into working on PyTorch and working on these applications, AI applications. And so for us to come in and immediately have impact and immediately be able to improve on what Meta has, we have to fit into that. We can't go and re-write all the applications. And so this is where this middle layer comes in, the entirety of the middle layer. It basically takes the output of existing applications, tries to find the kernels that we can optimize and run on the accelerator. It uses low-level code generation, is low-level compiler and a tool chain width with specific extensions, right? And takes these AI kernels and then finally creates these binaries that can then run on the aforementioned PRAs, right, the PEs that we had earlier. Those little binaries then get scheduled on there through the host driver and then finally through firmware. That hopefully that makes sense so far. Performance of this thing. So I didn't run these benchmarks, so don't blame me for the numbers. These benchmarks were run for the ESCA paper. That was actually published last week. I think ESCA was last week. So if you wanna know more, you can go read the whole thing. But generally, we feel like we do okay. We do okay on sort of lower complexity and medium complexity recommendation models, both in size and gigaflops per batch. I see some of you straining your eyes, I'm sorry. You can either look it up in the paper or you can just download the slides later from the app. But like I said, generally we do well. We don't do as well on high complexity recommendation models compared to GPUs. But we think that we have some headroom there. We blame it on the software people. We just need better software to run on this. And it kind of makes sense for higher complexity workloads. It just takes time to get to the maturity of software. Okay, specifically now back to firmware and less of the marketing slides. What are our challenges? What do we do? We write device firmware for both of these accelerators and they're both based on the Zephyr RTOS. We use the LTS branch, the long-term stable branch. Yeah, Chris, go Chris. Chris is on my team. He's the maintainer for the LTS branch. And we rely on upstream for bug fixes and for security fixes. We do some work ourselves, but we mostly just rely on you guys. And then the firmware runs on risk five multi-core processors. We basically have all of, I think most of all of our processors are risk five. They're actually capable things. I know it looked like it was a little thing in the corner, but it's actually fairly capable multi-core risk five processor. It's not what you would normally find in an IRT device. And so that's maybe something that we've had to work on improving even upstream to risk five support for it. All right, so most of our development, most of the firmware development has actually done pre-silicon before hardware shows up. This is kind of a unique challenge because it also means you have to also build and maintain the emulation capabilities that mimic the target hardware. And that's also something that kind of my team works on. This is because as soon as hardware shows up, we wanna be ready to go to production, right? As soon as possible. So we need hardware, we need firmware that works for some definition of works. And this is also where we leverage some of Zephyr's capabilities to target emulation such as QEMU and native POSIX, right? It helps quite a bit to be able to test your code without real hardware. All right, so this is my testing is hard slide. That's basically what it says here. Given the complexity of the software stack, we, and the complexity is also in organizational, right? And like the number of teams that are actually involved in working on that software stack to the point where teams don't even necessarily fully appreciate what the others are doing, but we have to come together and we have to test the end-to-end stack to make sure that when hardware comes back, we have something that works and works well. And this is hard. Probably most of you probably believe me because we have to build a testing infrastructure that's fairly complex, that runs application on a virtualized host talking to firmware that runs on emulated hardware. This whole thing has to work, has to work reliably. And this is something that we invest quite a bit in. For low-level firmware tests, we're actually adopting, we rely on Zephyr's test framework, Zephyr's test framework, and it allows us to write these simple, faster-to-fail tests. All right. I mentioned integrating with Meta's infrastructure as one of the challenges, one of the hard things to do. One of the places we integrated that I can actually talk about is our build system. So Buck, or Buck2, is Meta's build system. That's written all in Rust. No, yeah, Rust. It's written, yeah, there we go. It's written all in Rust and is designed to be super great. It's open-sourced, which is why I can talk about it. And it's used by thousands of developers every day. It gives us a lot of leverage integrating with Buck, because then Buck in turns integrates into many other places at Meta, places, tools, and systems. Things that give us better developer experience, better developer efficiency, but also things that give us better visibility or test scheduling. We kind of get all that by integrating with Buck. One other kind of neat thing with Buck is that it's written to integrate with remote execution. So you're able to write firmware or write any software locally and then build it and execute it remotely. In our case, it's really convenient because you can write firmware locally and then execute it remotely on a machine that happens to have a certain accelerator in it. All right, and so I mentioned Rust. Rust is now fully supported language at Meta. It's actually a recommended choice for CLI tools and the number of tools and services written in Rust is rapidly growing and it's probably no stopping, it's no matter what people think. And we've actually written some firmware in Rust, believe it or not. And we actually see this as something that's coming up and we're gonna have to figure out how to support writing Rust firmware in the future. All right, we've got some conclusions here. So just to kind of summarize, AI workloads, they're kind of everywhere at Meta and custom accelerators are our way of extracting efficiency out of the resources that we have within the year of efficiency, if you guys heard that term before. The software stack, it's a fairly complex beast but at the bottom of it we have our firmware that's based on the Zephyr RTOS. A lot of our challenges are meeting the performance requirements of firmware but also kind of integrating into the infrastructure and culture. I don't wanna understate the culture part, it's actually, I've been at Meta for five years, it's actually kind of interesting to see some of those challenges. And then finally, in terms of our interests, specifically around REST 5 and now Rust, this is kind of where we feel we can have a lot of impact possibly on upstream in the near future. All right, that's it. My timer hits at 18, so I have some time left. I don't know if that means we're just done or done. All right, I still have time left because I didn't plan for 18.