 Hi, everybody. My name is Amar Padmanabhan. I think I know most people here. I'm an engineer on the team, and I've been on the Magma program since it started. So a couple of things that we wanted to anchor ourselves on before we go through the rest of the day. So the first thing, as into your question, as in why isn't Lighter EPC an option? And I think we wanted to lay out some of the principles that we started off with and why we arrived at a distributed EPC architecture. And so you have, you know, it's not just us smoking some stuff. It's like we have a real reason to do what we're sort of doing. I think I don't have the clicker. Okay. So I think the main starting point for us, this is, you know, even like four years ago, was that looking at the problem of connecting the next billion onto a faster internet is actually a problem of heterogeneity, right? Someone was bringing up the point about 5G and, you know, the new stuff that's happening on 5G, LTE, there's Wi-Fi, there's different kinds of access. If you look at all of those things, then there's also different kinds of backhaul, right? Like, you know, for example, our friends over at Brick are deploying microwave, murals deploying microwave, and then there are enough folks who are deploying fiber, and then, you know, we're working with some exploratory conversations around satellite as well, right? So there's heterogeneity in backhaul. There's heterogeneity in scale, right? So if you're going to an emerging market like APAC, it's very dense. So a lot of where your population centers are, you need to have high capacity radios that are actually sort of solving for a population dynamic that is very different than a rural Native American establishment that, you know, mural, for example, is targeting. And then the business models, right? So the whole problem of connecting the emerging markets is not a pure technology problem. There's a lot of things that are on the business side that needs to change for us to be able to, you know, bring the disenfranchised or the lower-arpooh people onto the market. And Yarno would be talking a little bit about the two-sided market later on in the day. So given the heterogeneity, we thought we had to lay down a few principles in terms of how we think about solving for all of this heterogeneity using a single solution, and which is what we ended up with Magma. And we draw a lot from our data center world. So Facebook, as you guys know, is running the fourth largest network in the world. And whether you look at a data center in Singapore or a data center in the US, which is handling very different traffic, whether you're looking at Hadoop workloads or Webfront and workloads, a lot of the data center looks the same. So despite the requirements being so heterogeneous, we managed to build on some key abstractions to sort of keep the heterogeneity manageable, because the death of any software project is heterogeneity. As long as you have heterogeneity leaking through your network, you will be building one-off solutions forever. And so that's sort of the cornerstone of what we started with. And this is almost four years ago. So there are four principles that I'm going to try to cover. And then given that we have about 20 minutes, I'm happy to keep this interactive as well. As soon as we end each of these topics, I'm happy to take questions on that. So the first one that we're going to cover is about the edge versus fabric decomposition. So this is something that we see in a lot of the web front, sorry, web-based companies, including Facebook, you look at like Amazon, Google, Microsoft, all of these companies have a similar sort of a philosophy. Right. So back in the day, even like seven years ago within the Facebook data center, we had a hierarchical network with devices that were performing in network processing. So what does this mean is that the packet traverses a certain path in your network, and then you optimally locate, say, your firewall box or your IDS box or your load balancer and you force all your traffic to go through that. These choke point devices are a force of topology. So, you know, for example, regardless of how many Hadoop nodes you add, you need to send them through the same load balancer or the same firewall or some of these things. The second one was with the data demand that we started seeing in our data center, we needed to ask our vendors who are offering these boxes to make these pipes faster and faster. But these pipes would also enforce these devices, middle boxes were also enforcing policies. So it became a really big challenge for them to start supporting like the petabytes of data throughput as well as the policies that they sort of needed to manage. So the way we circumvented that was we divided the network into two things. We divided the network into the fabric, which was purely focused on moving packets as fast as possible. So this is a picture from about three or four years ago of our Facebook data center. Now we're in 16 planes, so, you know, the network is denser. So all that the fabric of the Facebook data center does is it moves packets really fast. And then we had a policy ridge edge, which was enforcing all of our load balancer and, you know, our firewalling, ACLs, traffic shaping, and all of these other policies that were implemented in dedicated middle boxes. A couple of notes on the edge services. Some of this might be obvious. I just wanted to anchor them on it. So as soon as you start moving workloads to the edges, they become fundamentally distributed. So as soon as you start distributing this, the aggregate throughput problem becomes much simpler for you to tackle because there are multiple points in the network at which you're enforcing policies. And as long as you're able to do that in a distributed way, x86 is a really good platform for you to build that on. x86 does complex things, so it can look up a lot of data and everything fast, sorry, well, but it cannot do it fast. So distribution is a key for making x86 successful. The second one is software and only policy enforcement. This is like a core tenet of SDN and that's why a lot of the Kool-Aid has been generated around it, is that software only policy enforcement enables rapid innovation. And we are doing a lot of work on moving a lot of our data center infrastructure into software only and we are reaping a lot of the benefits on that. The third one is that because software allows you to do richer things, you can implement much more programming language sort of semantics like OpenFlow and eBPF, Facebook is a big pioneer on eBPF, Magma uses OpenFlow. And so these programming languages allow you to interact with the network at a much higher level way. And this simplifies the way that you're programming. So for example, if folks have worked with like say the Broadcom SDK or something, you know how complicated it is to work with an ASIC every time there's a die recast, things change. But the abstraction that x86 gives you with these programming languages makes it much easier. The last one, which is not, distributing services is not a panacea. As soon as you have many more nodes that are sort of enforcing your policy, operationalizing the solution is actually a key thing. And you will see a talk from Jackie and Josh and Scott later today that is talking about our focus on the NMS and operationalizing stuff. And this is going to be a key area of investment for the Magma team going forward as well. Excuse me. So how does this translate to today's GSM and LTE architecture? There are lots of nodes in the network like the SGSN, the GGSN for our 3G protocols, the S-Gateway and the P-Gateway for our LTE protocols. These are stateful devices, they are choke point devices that are forcing our traffic to sort of go along a particular path. So taking this approach further, the first thing that we're sort of trying to do with Magma is to modularize the network. What do we mean by that? We're going to distribute the policy enforcement point at the edge. At the edge, and an edge is like as Meghana mentioned is a sliding scale. So depending on how far your local breakout is, your edge can be in building or it can be at the local pop. The second one is we're moving all of our policy enforcement to software. This allows us for rapid innovation and all of these things. The third thing is that the core itself is just about punting packets. So now you have the problem, sorry, now you have the simplification that all you need to focus on in the core of your network is moving packets faster and all you need to solve for is IP packets moving faster. And then because we're used to building a large network at scale, we're able to spend our time focusing on operationalizing the network. Sorry, I just need a second of water. Any questions? Cool, awesome. I'll go on. So the second principle of Magma that we tried to do is encapsulation of state. What do we mean by that? So in a traditional data center, state is maintained throughout a network, right? Each middle box is maintaining some state that's associated with your end client. Now this state needs to be synchronized across services. Every time you're changing the workload that's coming into the network or removing a workload from a network. The second one is each of these middle boxes were solving scale out and high availability independent of each other. So in cases of partition, which we've seen in our networks as well, there may be two middle boxes that make opposing decisions. The load balancer failed in one way, the firewall failed in another way, and we started black hole in traffic. The last one, which is more dynamic in the Facebook internet world, is that a lot of our workloads are dynamic. So this made it really hard for us to provision a lot of the workloads that are coming in dynamically. So modern networks basically try to solve this like the way that we do it in Facebook data centers, or you see it in AWS as well, by encapsulation. So instead of having the state associated with the workload being spread throughout your network, you sort of move the state associated with the workload into the same container or vehicle that you have for your workload itself. So this gave you the bonus benefit of fate sharing. So supposing your workload went away, you've lost the network state associated with it, it's all right, because there is no more workload that requires that network state that you sort of needed to have it. And then the third implicit benefit here is scale out. As in when you provision more workloads, you have more state in the network, as in when you remove workloads from your network, you have less state in the network. So it's not a the the dimensioning problem is sort of proportional to the network that you're sort of servicing. So encapsulation in traditional LTE networks suffers from the same problems that we had back in the day with like the hierarchical networks in in in in the data centers. The UE state exists in all nodes, like for example, the MMEs gateway, P gateway, all of these maintenance state associated with the UE. And the other thing which is kind of unique to the wireless world is that the air interface specifics leak throughout the network, right? Like whether you're connecting into a 5G network or a 4G network or a 3G network or a Wi-Fi network is known by a lot of nodes in your network, right? So there is no clean abstraction that we had that sort of insulated all the nodes from from, you know, having to know the generation of the network that you're connecting to. Imagine you're thinking about it, whether you're connecting with a laptop or with the with your PC on to a internet network. Imagine if the internet network had to behave differently, right? So so the clear abstraction that we didn't build in a lot of the wireless generations is abstracting the network from the specific generation that we were sort of talking about when we were trying to connect to the network. So here is an example. The UE state in the MME, for example, we maintain mass state, we maintain identifiers, we maintain bearer state and lifecycle manage them in the S gateway, and we maintain IP address allocation and policy enforcement associated with the UE in the P gateway. Now each of these devices had to fail independently, had to keep track of the state between the other devices. And this added a level of complexity that, you know, is is pretty challenging. And that makes a lot of the signaling overhead of managing multiple devices pretty, pretty large. So the second takeaway that we had was that to get to a simpler network, like what we have in the modern web based networks as well as in the Facebook data center, we had to figure out a way to encapsulate the state that was associated with the UE. So we use two principles here. Config state is maintained in a central location in the orchestrator, which, you know, folks will talk about in a bit. And runtime state is entirely encapsulated at the edge at the access gateway. Sorry. So so the other thing that we're doing here is we're abstracting away the radio specific technology. So as you can see from the manifestation between Wi-Fi and LTE for magma as well as the 5G stuff that we're starting to think about, the rest of the network doesn't actually is not aware of what generation of air interface we're sort of looking at. And you'll see something similar with, you know, with the new AX standards that are coming up with Wi-Fi as well as the new 5G stuff that's happening between like Europe and EMEA where we're not standardizing on technology. It's more and more important for us to make sure we reduce the number of nodes that get affected by the specific generation of access technology that we're using to talk to our devices. So at the end of the day, you want to get to a model where any spectrum is good spectrum, right? Whether it's licensed, unlicensed, 5G, 4G, any spectrum is good spectrum. And we don't want to change the way that we build our networks, depending on what spectrum we're using to get in there. So the challenge though, the downside of this is that as soon as you encapsulate state and you start moving it to the edge, mobility becomes a bigger problem, right? So there is a lot of work that we are doing right now, as well as we're investing going forward on how do we solve mobility in a fully distributed way. And there's a good precedence for us because we've solved the exact same thing on the data center side because tasks, for example, in Facebook, move around all the time and we're able to move the state associated with the network state associated with the tasks along with the tasks. So we're borrowing a lot of the techniques as well as the infrastructure that we built on the data center side to, you know, realize some of these things on the on in magma as well. So I'll take a pause here as any questions on the this. Yeah. Yeah, so they because the access gateway is distributed, we currently, the access gateway supports up to three, three not bees. So you so so more often than not, it's limited by the number of subscribers that can actually camp on an E node B. So we've tested up to 10,000 subscribers on an access gateway, but we don't have a use case where there are about 10,000 subscribers on a single access gateway because we try to co look at the access gateway with the E node bees. Yeah. Yeah, sorry. Yeah, I'll repeat the question. Yeah. So the question is, is the performance good? So with 20 megahertz, you're getting about 150 Mbps on the air interface side. And so we can easily do three not bees. So so without any DPDK or any acceleration, we get about four gigs on the access gateway. And if you do DPDK, it's basically bound by the next throughput. Yeah, but the core model here is that each access gateway is servicing few E node bees. And so the number of the total traffic that an E node B is bringing on to the network is limited by the E node bees throughput on the air interface side. Yeah, thank you so much. Okay, cool. So the third principle that we have here is the state in the control plane. This is pretty standard SDN stuff that we're looking at. So so the state in the control plane. So we follow a desired state model. So this is an evolution on what, you know, people did with open flow back in the past. So the state is centralized through an API. So you enforce your desired state through an API, and the user inputs intent and the control plane is responsible for enforcing it. So for example, if you look at models like the OCS and the PCRF, especially the OCS, it is not a desired state model, right? So the the BSS tells the OCS, this is the quota that a user has. And then the OCS is responsible for dishing out specific amounts. In our model, we have OCS like semantics where you can enforce like quotas, but the control plane is responsible for figuring out how to push that quota. All that the user is interacting with is the the entire quota allocation. The second one is that the control logic is completely decoupled from the data path. So you have programmable APIs that are exposed through the orchestrator, and then they get enforced on x86 using some technology. At this point, it's open flow. We're looking to enhance it with the eBPF as well. But the idea there is that any of the control flows are able to be evolved independent of the data path and vice versa. So as and when x86 makes more powerful concepts available to us, we're in a position to leverage that, and any application interacting with the control plane remains stable with that. The last one is we use modern like sort of distributed system techniques to propagate state, like we terminate diameter and sctp at the edges, sorry, s1 at the edges, and we use HTTP2 and key value stores to sort of, you know, manage the internal state as well as move the state around the network. Any questions here? Yeah, so we, so not entirely out of open flow. So the question is, other than open flow, what is, what are we planning to use on the data path side? So Facebook, so there's a new Linux LLVM that is called eBPF that Facebook heavily contributes to, and this makes it easier for us to build more stateful applications at the edge. So the idea is to use eBPF to enhance what we have with OVS if we need to do something slightly more stateful around the edge, but it's mostly to enhance as opposed to replace. The last one, this is closest to my heart at least, is software release, right? So today if you commit code into the Facebook code repo, it's available for people to use within the same day. We follow what's called a continuous push model and telecom networks are a long way away from that, right? So how do we get the, get to a level of agility that we need in telecom networks so that we can support a lot of the principles that we are used to here? And the cornerstone of that is the need for fault domains. So software delivery, so if you look at a traditional architecture, again, the devices are choke point devices, they maintain a large amount of state and every time you need to do an upgrade, you're absorbing an amount of risk that is not that easy for the operators to follow. So this makes these devices too big to fail. So what we are trying to do here is by distributing a lot of the access gateway as well as the carrier Wi-Fi gateway functionality to the edge, there are a lot more devices that are absorbing the complexity of the choke point devices and state gets distributed across multiple points at the edge. Now at this point upgrading each of these access gateways is going to take out a very small percentage of your network and especially if you tier your network correctly to have like a staging tier as well as a pre-production tier and then a production tier, your ability to roll out software upgrades across the edge becomes much simpler. So the last takeaway that we had was software upgrades. So start by designing for fault domains. So always think about small upgrade domains. This is the only way that you can mitigate risk as in no one writes perfect code. Let's make sure we flush out things through a rollout process as opposed to relying on the developers to actually make like flawless code possible. So each node in the magma architecture, especially the access gateway, is independently upgradeable and there are there is a notion of tiers that are available through the orchestrator and this allows us to upgrade gradually. So this this enables us to upgrade staging before production tiers and this makes it easier to do that. The next one is that the control plane is completely independent from data plane operations. So this has happened to us in the past as well. For whatever reason if the control plane goes down, existing users on the network are completely unaffected. We're still not going to be able to authenticate new users coming onto the network, but that's a much smaller outage as compared to say what O2 had a few months ago in Europe. Any questions about this? Yeah, that's a good question. So the question is what is the redundancy story about the controller? So the controller is a cloud native application that's built on Kubernetes. So that's truly scaled out. So depending on what your dimensioning is, we recommend provisioning three nodes of each type, but you can presumably scale it up across regions or within the same region. But it's a cloud native application. Yeah, so the question is what is the notion of a node? Yeah, so at this point we're purely a software entity. So there are deployment form factors where you're deploying it on a one-use server or a small like four CPU or door box. There are, I think, Boris is here. He's going to talk a little bit about the Mirantis integration that we did. That's being deployed on Wartlet, which is there. It's like a VM form factor that looks like Kubernetes. The question is how do you keep your S gateway and P gateway up during an upgrade without affecting the traffic? So the way that it works right now is that we push the cached flow state to the kernel. And then when we upgrade the user space process, we make sure we don't affect the existing kernel flows. So the idea is that if the first thing that the upgrade window is fast and then the second one is if there is some signaling that happened, let's say a UE tried to come onto the network or a bearer got removed or something, it will get retried. But for existing users who are already connected, the kernel cache would just continue to work for the duration of the lifecycle of the user space process. That's right. So the question is existing TEIDs are supposed to be backed up? Yes, so I think Shruti is going to talk a little bit about stateless MME. And at that point, she'll talk about it. So we run a key value store, which is Redis, on the box itself. So that gives us durability across restarts. But there is an in-memory version of the flow state in memory, sorry, in the kernel. And that keeps the TEIDs stable for the duration of the upgrade. But we still are trying to minimize the amount of time actual software upgrade takes. Because you just have to upgrade the bits. And only then do we restart the process as opposed to restarting the process and then downloading the bits and all of those things. Sorry, the question is, are we using DPDK or not? For the user space? No, so at this point, the only place where we considered using DPDK is for the acceleration of the OVS itself. And we've not found the real need for it right now. Potentially, if the number of Enode Bs we end up having to support access gateway increases to a point where we need DPDK, we will use that. So the general problem with DPDK, as most of you know, is that the debugability toolchain around DPDK is very poor. So for happy paths, it works great. But as soon as stuff starts going bad, it takes a lot more time to debug. So it's one of those things that we have in our back pocket but we haven't had to use it. No, again, so our throughput because we are fully distributed, you only need to measure the aggregate throughput. And so you keep stamping out nodes and you can potentially get the throughput you need. But as we are seeing some denser deployments, I think some of the stuff that Yarno is going to talk about will drive those use cases, but not right away. That's right. That's right. And I think that that's fundamental to a lot of what the web-based networks sort of do right now, right? Because you don't want these big devices that are so holy that you're afraid to upgrade them. Because agility in the network is a very big part of being able to innovate in your network. So I'll just have another minute or so. So I just want to close this up. So the first principle is around flexibility. So we're focusing on modularizing the network between a fast fabric and a policy-rich edge. The second one is around scalability. So this is the point where we're stamping out more nodes to scale up or scale down. By encapsulating the UE state into a gateway node, we're able to stamp out more gateway nodes or bring them down. And that sort of allows us to do all of those things. Any spectrum, this is what we're thinking of when we isolate the protocol-specific stuff to the edge of the access gateway. And we have a protocol normalizer. So depending on regardless of whether it's 4G or Wi-Fi, the rest of the network looks the same. We're looking to expand that to 5G as well in the roadmap. The last one is, sorry, the next one is programmability. So we work on a desired state-stole model with a control plane. So you program the control plane and then the control plane is responsible for delivering that forwarding state to all your nodes. And the last one is agility. There is no single large fall domain. So there is an ability to sort of upgrade nodes within your network easily. Cool. Thanks. That's all I had. I think next up we have a coffee break. So feel free to walk around and meet the people next to you. Thank you.