 Good morning everyone, I'm Vanessa Little, I'm the director of ENEV ecosystem architecture over at VMware, but try not to hold that against me. I'm going to talk to you today about some of the gaps in using OpenStack for multi-access edge computing and distributed architectures or MEC architectures. I'm going to go through some of the design flaws in OpenStack that need to be remediated to really make it effective for MEC, and I'm going to talk a little bit about how you can work around those things today so that you can at least achieve smaller deployments. So here's the agenda of what we're going to talk about. I'm going to define what MEC is in case anyone is a little foggy on that. We're going to look at the core architecture challenges, we're going to look at some of the tools or bolt-ons that also have challenges, and then we're going to talk about what we can do about it. So there are two widely used terms that seem to be interchangeable in the industry. There's mobile edge computing, which specifically talks about pushing workloads to the edge for telco and mobility workloads, so cellular workloads, and then there's multi-access edge computing, which is what I'm really more focused on in this lecture today, and that talks about any kind of infrastructure that's pushed to the edge, not necessarily for telco workloads, but any workloads generically. So when you look at mobile edge computing or multi-access edge computing architecture, what we're really looking at is a centralized control plane where your orchestration lives and your core VIM components live, and then you've got these edge data centers, and an edge data center could be as small as one node, it could be a cell phone, it could be a connected car, or it could be a micro data center. When you look at the shape of these types of architectures, they really take different types of iterations, and so because what you define as the edge could be so many different things, it could even be a camera, the architectures can shift and pivot a bit, but the overall fundamentals are the same, where you have centralized orchestration and VIM management, SDN management, and then you have your workloads or your actual VNFs or your workload VMs or containers pushed down to the edge closer to the users that are consuming those services. When you look at what the use cases that people are looking for, Mac 4, particularly in the telco space, these topologies get really massive, and so when you're starting to look at things like 5,000 edge clusters that are all centrally managed by one node, you start to notice some of the cracks in OpenStack as it is currently built today in being able to manage those types of architectures. This is another visualization of when you think of Mac, and you think of centralizing that control plane, that network orchestration, that service orchestration, and then pushing those services out to an edge data center, you don't necessarily have to have direct connected network fabric to do that. You can push it over the internet. Those edge data centers can actually exist in a public cloud somewhere, in a private cloud on-prem with a customer. They can exist pretty much anywhere, and they don't necessarily have to be fixed in one particular place. The whole concept behind Mac is that those edge data centers, those edge devices can move, and they can appear in different places. And so being able to automate the control and telemetry and the management of those edge data centers proposes really unique challenges. Traditional OpenStack wasn't really built for data centers that are moving. It wasn't really designed for data centers that are physically separate. It wasn't really designed for data centers that rise and fall in the blink of an eye, that may exist for one hour only, and then be torn down. And because of this, because of the volatile nature of Mac, that also presents a few unique challenges that we're going to discuss. When you look at how it's being used for telcos, and the telco use case, it becomes really nonsensical to assume that you can centralize all of your control into one data center. That would be a giant single point of failure. And so you look at clustering those core data centers where your control plane lives, and then you have this concept of aggregation data centers. And so that's where you summarize some of those control plane components so that you don't have to backhaul all of that control data from the centralized data center to the edge. So for those of you who are more network oriented people and you look at how telephony networks are built, you have a core network where a lot of your core switching and routing happens, and you have your aggregation data centers that have summary routes for the edge data centers. So the concept of that topology also applies to multi-axis edge computing when you're looking at workloads and managing those workloads. So here's a scenario for you to think about. You've got a workload out of micro edge data center, and you need to monitor it. Say, for example, it's a video transcode application. And you've pushed it down to that particular data center because you've determined that that data center is the most physically close and has the best ping time to the user that needs that video to be transcoded. Now you need to pull telemetry data off of not only the VIM layer and the infrastructure layer, but also the app layer to determine that that service is healthy and to know whether or not you need to scale it out or restart it or no one's using it anymore so you can shut it down. But if you do that at a large scale and you've got 5,000 instances of these spread across a different country, pulling all of that telemetry data back to your centralized data center doesn't make any sense. What you would want to do instead is pull that telemetry to your aggregation data center, and that's where the decision as to whether or not that service is healthy actually occurs. So instead of pulling all of that unmanaged telemetry to a central data center where the decision has to be made, you pull it a little bit closer so that you're not backhauling all of that traffic, and then you make a decision and then only one instruction goes back to your centralized data center to say that service isn't healthy, I want to do something about it. In some metapologies that do something about it operation actually occurs at the aggregation data center. And so being able to spread it out like that but still have the centralized data center manage the overall topology and then have the aggregation data centers manage the edge topologies is a lot more effective when you look at spreading out these workloads. And when you look at how the availability zone and cluster design in OpenStack is currently built, it's not really granular enough to do these types of data centers and these types of topologies because even though you can define different availability zones different clusters within those availability zones, pushing a workload down to a specific node for a specific reason isn't really built into OpenStack today. There are some other tools that try to achieve this but they're still lacking some of those telemetry bits that make that a very effective decision. And so one of the things, one of the more obvious things that is a problem with OpenStack for metapologies is that there's some very obvious scalability limitations. Current industry limits in production deployments max out at around 1,000 nodes. We know the theoretical limits are higher but in practice people are deploying no more than 1,000 nodes in one OpenStack deployment. This obviously doesn't bode well for metapologies we're looking at having 5,000 micro-edge data centers not to mention how many data centers you'd have in the aggregation layer as well as the centralized layer. It's just simply not enough nodes to be able to build those types of topologies. Storage over WAN is way too sensitive to latency and packet loss and so being able to manage your storage the way you typically do in an OpenStack cloud isn't really feasible over these architectures so you have to kind of break it and have different storage pods defined in each location and it makes the administration of those a little bit onerous. When you're looking at over 500 hosts and 15,000 VMs in one region but then you also have multiple regions within the same topology it's really bumping up against the limits of what OpenStack is currently capable of today. So let's talk a little bit about why that is. One of the things that is a big issue is the image storage issues. So if you need to deploy a workload down to an edge data center you don't really want to backhaul that image file all the way from your centralized data center right down to your edge. Some of those images can be quite large and if you do it often enough if you're actually having very volatile workloads pop up at the edge which is one of the core premises of MEC and the whole MEC topologies pushing an image down every time you need to do a deployment becomes really ridiculous. It'll lead up all of your bandwidth and it really makes the MEC topologies not financially viable. And so to get around that what you can do is put a glance image store locally either at your aggregation site or right down to your MEC edge site but now what you've done is built a whole lot of unique glance image stores that you need to keep in sync. So now you have to deploy some tools to keep that in sync. Assuming of course that you want to keep it in sync because you might want to offer different versions of different images at different locations. When you start looking at some of the regulations particularly in North America around where you're physically allowed to supply certain content now you're getting into a really complex mesh of how you manage all of your images. For example, what if you had a CDN network that shows local football games? In the United States there's blackout rules that say that you can actually distribute that content within a certain radius of the stadium. So being able to make that decision programmatically and manage all of those different image files becomes pretty complex and pretty difficult. OpenStack doesn't currently have any tools out of the box to give you that capability. In order to achieve these multi-access edge computing architectures, SDN overlay networking is a must. There are a lot of people who have been achieving a lot of proof of concepts by using ODL and influencing some of the physical devices between the centralized data center and the edge data center. And that's great on small scales and it works really well on small scales. But when you want to start looking at building a full layer two mesh across 5,000 different data centers being able to do that with ODL alone is not feasible. It's not, I don't even think it's possible. But I challenge the room if anyone's actually pulled it off. I would like to hear more about it. And so being able to snap in that SDN overlay networking becomes essential to achieve these architectures. And then being able to orchestrate that SDN overlay networking becomes essential because it's not feasible to manually manage a network of that size by doing all of that configuration by hand. You need to push that either to an orchestrator or some sort of automation like Terraform or one of the NFV Mano solutions to be able to pull that off. It's not trivial. And the biggest and most fundamental flaw in OpenStack that makes the MEC architectures not necessarily feasible is the bus base architecture. RabbitMQ has some serious limitations around latency. And if you start to push those components further and further apart, strange and interesting things will happen to your OpenStack cloud. If the RabbitMQ bus has more than two milliseconds latency, you're gonna see some odd behavior and you're gonna start seeing failures in your cloud. But when you're pushing physical infrastructure, possibly thousands of miles away, even having a really robust network between them, you're likely to get more than two milliseconds latency. And so having those control components centralized no longer become feasible. And what you end up having to do is have little OpenStack deployments all over that need to be centrally managed. So now instead of having one OpenStack deployment that manages 5,000 micro data centers, you have 1,000 OpenStack deployments that each manage 200 data centers. But now you've got all these OpenStack instances to manage and maintain and push images across and roll up telemetry from and be able to migrate workloads from one instance to another. Now you need to have an orchestration layer that can do that. Even some of the more sophisticated Mano layers can't really manage infrastructures that large right now. Some of them boast that they can, but in practice no one's actually been able to pull it off yet. And so that's a real challenge with OpenStack is this bus-based architecture. But that is so ingrained into the core fundamental way that OpenStack is built. It's very difficult to work around. And so the only work around to do that is to just have more OpenStack instances distributed closer to your edges, which is not necessarily feasible. So as I just mentioned, the orchestration models and the tools, they're not really good enough to pull this off yet. When you start looking at intelligent workload placement and being able to gather all of the data that you need to make a decision about where to place your workload, current orchestrators don't even have that type of thing in their data model. For example, if you wanted to push a workload to the data center that has the lowest ping time to the user, none of the orchestration data models currently have this built in so that if you wanted to model that service, that whole concept of ping time doesn't even exist in the data model. And even if it did exist in the data model, there's no way to actually pull that data out of OpenStack right now. There's no tools aside from adding bolt-ons or additional things or writing some scripts or having some various monitoring tools added on to get you that data. And so intelligent workload placement, which is like a core tenant of mech and a very important factor, is not even possible with the current infrastructure today. So what do we do? What do we do instead? I think we've beaten OpenStack to death. I think I've said a lot of really nasty things about OpenStack, so what's the solution? There's a number of different paths that we can take here. One is to start adding those features that we would like to see in OpenStack, which as I mentioned is gonna be really difficult because of the way the RabbitMQ bus works. Pulling that out of OpenStack and replacing it with something that's a little bit more distributed is kind of akin to changing an engine of a car that's running on the Autobahn. It's not necessarily a great idea because of how much change it would impact into the code base. A lot of people say containers are the solution for mech. So if we just use Kubernetes and push a bunch of Kubernetes clusters down to the edge and find a way to central manage them, everything's gonna be fine. In practice, there aren't enough applications that run in Kubernetes to make this feasible, especially in the telco space. And so when you're looking at how do you push these workloads to the edge and even intelligent workload placement in Kubernetes is not really there yet. That whole concept of latency and ping times doesn't exist in Kubernetes either. The only way you can get around that is by just deciding to spin up a new Kubernetes at that micro data center and then tearing it down when you no longer need it. Well, what you're effectively doing is spinning up a Kubernetes instance every time you need to load an app. That's not really the way it was intended. That's not really the way it was built to work. It would be a nasty workaround and in my opinion, a waste of infrastructure to do it that way. People that are currently doing it, they're using what I'm calling a hybrid model where they have some Kubernetes at the edge. They have some VMs at the edge. They have some IoT gateways at the edge and they're using a lot of bolt-on open source stuff that they're manually integrating and making this unique snowflake of an architecture to be able to achieve all of these things and fulfill all these different needs that they need to do, to do intelligent workload placement, to be able to manage a cluster that's that large, to be able to manage these workloads and push them out where they need to be and actually be able to determine their service health on day two and do something about it when that service health fails. And then the third alternative solution is let's just scrap all of it and start over with a brand new open source project. And so there's some interesting things that are starting to pop up in the industry right now. Etsy has spun up a whole MEC initiative to be able to start defining what those standards should be, what do those components need to be in an effective MEC architecture, and what are those interfaces between them need to be and similar to the approach that Etsy took to NFV, they're taking to MEC and their resolution is start over. This is not an NFV architecture, so stop trying to use NFV tools, stop trying to use cloud tools to achieve this. MEC is something unique and it needs to be addressed in a unique way with its own type of software and its own paradigm and its own architectures. Starting from scratch in that sense is a little bit scary because I feel like we might be a little bit behind in that sense. There are some interesting projects like Project Arcreno that have popped up in the past year to try to address these issues and start looking at, okay, what kind of tools do I really need at the edge? Intel built their network edge virtualization software development kit to start making some of these apps a little more feasible, and they've started to build in tools that make modeling these apps a little easier to do, but it is in effect just an SDK. It's not a solution, it's not an app, it's a toolkit. It's like throwing you a bag of tools in a pile of lumber and saying, okay, build me a house. It's not a house, and so it pushes the responsibility back onto the open source community to say, okay, if these are the standards that are coming out of Etsy and we all kind of agree that this is the way it should be, here's the toolkit I have, well let's get to it, and that's what Project Arcreno is attempting to do, and I think so far they've had some reasonable success. There've been some really interesting demos at this show and also at the one in Vancouver where they've demonstrated some of the things they're already able to achieve at the edge, but it's by no means finished. And the real question of the day is can we wait for something new? With all of the hype around Mac and 5G and IoT and everyone saying I need to go to market with these services now, is it feasible to really wait for the Etsy standards? We all know how long those take. Is it feasible to wait for a brand new open source project to spin up out of nowhere and become a viable solution that's production ready and fully supportable and it's gonna have commercial vendors that are gonna distribute their own version of it and support it, and you'll have someone to call when it breaks so that you can maintain all of your SLAs? Can we wait for that? I don't think we can. So what's gonna happen? My personal prediction is people are gonna deploy the hybrid architectures with the NFV tools that are available today and there's gonna be some serious functionality gaps in those until these new platforms are ready and then you're gonna see a massive exodus and a massive migration. And will that end architecture have open stack in it? Probably not. There'll probably be something totally different or some spin-off or maybe some components from open stack that are applicable will be cherry-picked and added to this new architecture. But open stack, as it currently exists today, I don't believe will be in the end state architecture for Mac. But I'm interested to hear what you folks think. So I've left a little time for questions and comments. So anyone who has an opinion on what I've said because I've offered a lot of my opinions in this lecture that you may or may not agree with, so please feel free to step up and ask some questions or offer your own opinion. Just onto your last statement that open stack might not be the feasible choice for the Mac in the long run. You mentioned one of the solutions might be that the open stack clusters would be pushed towards the edge and they would have some controllers and they would manage some different servers that are very close from the controllers, be it below two milliseconds just to avoid the problems related to rabbit. Is that not a feasible choice? Because then indeed we have multiple open stack clouds. We move the problems a bit away from open stack but that's just the way how to push images, how to push flavors, networks, et cetera. But I bet that there will be some ways to do that in a nice way. So why not open stack? Just asking the question. Yeah, I mean, I agree. It is one solution to push a complete open stack cloud to every edge but when you're managing over 5,000 edges is that really a feasible solution? I would argue no. It's very difficult to manage maybe 20 unique open stack clouds within the same infrastructure. When you have 5,000 that are all doing something a little bit different and being able to leverage them as failover clouds for each other, that gets pretty onerous, that gets pretty difficult and it requires some pretty clever architecture and some pretty clever day-to-day maintenance and a pretty wicked orchestrator to be able to sew all that together. Sorry, you had a question? No one tells, it should support or something like this. Am I seeing similar initiatives? Sorry, where? No one tells, as explained, for example, in the TEL use case, they use it really to have something like 73 TELs in various locations, not only in Geneva and get the results, get the locality of some of the components while Geneva explained that Newton and some other projects are very different, so I would appreciate if the community would work towards such kind of, yeah. I would appreciate it too, because I think that's definitely a step in the right direction. Being able to disaggregate different Nova cells and manage them is definitely a step down the right path, but it still bumps up against that RabbitMQ issue. And even if you have unique instances of RabbitMQ paired with each of those different Nova cells, it's still very difficult to manage because effectively what you've done is you've built like a mini open stack cloud. It's not necessarily the same open stack cloud, you've just broken off a few pieces, but because you have to keep pairing that RabbitMQ bus and keep it proximal because of the way it works and the way workloads are pushed onto that bus, it doesn't necessarily solve the problem, but it does allow you to scale out a little bigger than you can with just basic open stack today. Anyone else? I would potentially welcome something like a federation or some other approach, similar to the use of open stack cloud, because it's federated and there's still a single anti-ion that applies to all of them. There's a lot of people working on federation kind of one layer up and they're doing it at the orchestration layer and what they're doing is underneath they're connecting to unique open stack instances and unique open stack clouds, but what they present to the user is more of a unified experience where they push a lot of the same config down to those open stack clouds to make them appear the same. So effectively they're sort of faking it at the orchestration layer. ONOPS approach is kind of like this. OSM's approach is also very similar. As of release five, they have this feature called WIM management. So wide area of WIM management where you have multiple WIMs but to the orchestrator it just exposes one field of compute that happens to have multiple data centers in it. It's one way to work around it. It's one way to mitigate it but there are still some unique challenges about being able to migrate a workload from one open stack instance or one open stack cluster to a different one. Even if they're configured completely the same there's a lot of logic and workflow that has to happen to be able to take one of those workloads and copy it to another one. Anybody else? All right, well thank you so much for coming early. I really appreciate it. I know it's difficult on the last day of the conference and if you have any questions you can either catch me in the hall or my information's on the app with this lecture. So feel free to reach out and chat about any of the stuff we've discussed today. All right, thank you.