 My name is John Casey, this is Sirad Mohan, we're from Seaplane Networks. Today, we're going to talk about building multi-site distributed open-stack clouds. So hyper-distributed clouds or multi-site clouds, discussions have started in previous open-stack summits in Austin and Tokyo. We're going to discuss this from a customer perspective, we're going to kind of extend what's been talked about in the past. We're going to talk about what's driving this, applications and also the marketplace that's driving this. We're going to talk about a real-life example of a production customer building, deploying and already in production of a hyper-distributed cloud and then we're going to talk about lessons learned and how to enhance the community, open-stack community and bring what we've learned back into the community. So AWS, great example of a cloud today, hyper-scale cloud, one of the most successful clouds out there. They're able to do things that frankly enterprises can't do, a lot of movement into the clouds. And in hyper-scale, the problem is how do you build out large segments of compute storage resources and sell them to enterprises and customers. The problem that they don't solve is latency, but it's ever used AWS knows that if you have an instance in Asia, an instance in the US, latency is a killer. So this basically shows the latency, pretty well known across AWS, and that's what we're really looking at is solving that latency problem with hyper-distributed clouds. So the service providers kind of caught on to this a few years ago and said, look, we have networks everywhere in the world and we are partners with AWS and we are places that AWS is not. So we can solve the hyper-distributed problem of clouds, we can provide low latency clouds and we can partner with AWS to extend their reach and extend essentially clouds. And so this is really transforming service providers in fundamental ways and we'll go through that in this presentation. So the drivers for this, if you think about hyper-scale clouds, what's being done, the applications that are being put up there are heavily storage oriented, right? So storage, backup, big data applications, and when you have lots of compute power, you can do these things. But as you kind of move along this path of distributed applications, think about SAP, where you have an ERP system that's managing a service chain or a work chain in Asia, different parts of Asia in the U.S. and Europe, you want to locate that application as close to the processing or the business as possible and bring that all together. And so service providers are looking at that problem. And now in the Edge Cloud, if you think about low latency, what you can do with maybe mobile edge computing, streaming media, you know, IoT, games, self-driving cars, you really need to think about low latency, very low latency, for those applications, right, to support those applications. So the service provider infrastructure, they're looking at placing compute resources close to the Edge as possible. That means that they can support applications everywhere where they have PE routers or Edge routers, and that gives them the ability to provide 10 millisecond latency or less. And if you actually think about putting stuff out in the cell towers, you can think about even one millisecond latency, right. So the applications that you can support are not just big data now, they're really real time applications. So this is really sort of transforming the thinking about cloud in pretty fundamental ways. Just to give you a for reference example, it takes about seven milliseconds for pain to reach the tip of your finger to the base of your spine, right. So if you imagine that compute resources at every cell tower around the world, you can essentially respond to with compute resources under one millisecond. So imagine what the world would be like if you actually had this sort of reactive nature across the globe. And so the five key trends that are driving this are really mobile Edge computing, network function virtualization, SDN, SDN, IoT and 5G. And the rollout of 5G really sort of promises to bring mobility everywhere at low latency, combining that with Edge Cloud provides those compute resources at the edge. So why is distributed OpenStack or distributed cloud, multi-site cloud, hyper-distributed cloud different than what we typically think of as standard hyperscale? Well, you're dealing with a real different set of problems. It's not about tweaking one pod of OpenStack or making one pod of OpenStack really kind of work and getting the most out of it. It's really about getting them all to work simultaneously together as one global cloud. So what you might think of in tweaking the OpenStack, you really can't really think about that. It's cookie cutter solutions. Also the problems exist, most of the problems really exist outside of OpenStack and what I kind of describe as people places and things. The people, so when we went from siloed IT where you had network storage compute and we now converge that into what we call cloud, you now have that in spades. So beyond combining those skill sets within one OpenStack instance, now you're doing that across geographic regions. So somebody dealing with a network in New York wouldn't be the same person dealing with a network or storage in, let's say, Japan. And then places latency, latency is a killer, complexity is a killer. So being able to, knowing that you have huge latency from one OpenStack instance to another is a huge problem that you need to overcome. And then the things, because the infrastructure, you're not building one OpenStack site to solve one specific problem, you want to amortize OpenStack because you're building it globally, you want to amortize OpenStack across many, many different problems, many different applications, it really sort of changes what you integrate with from a northbound perspective, how you manage it, and the applications that support that, right? Because that amortization is big and long and the businesses, well certainly service providers anyway, have a pretty complex business model, right? They have retail customers, they have wholesale customers, they sell and different chains of things, they have peering points. So the business model for service providers deploying globally distributed cloud is pretty complex and we need to fit into that with deploying OpenStack. The application topologies driving this hyper-distributed cloud are also pretty complex. I mean think about SAP, think about deploying parts of SAP across the globe and connecting them together, so that's a pretty interesting deployment. SD-WAN, you know, you may have an NFV in one service provider in AT&T, for example, in another in Verizon or in Deutsche Telekom. And then lastly, it's the N squared problem. The N squared problem, what that really means is this is a complex system you have a number of OpenStack clouds, geographically distributed, think about thousands. Now you have a number of systems you're integrating with. The problem is that instead of one management system or one BSS system or one portal, now all of those portals have to connect to all of the clouds, right? So this is a split brain problem. You have the complexity of the human complexity, right? So you have different people, different organizations, managing, monitoring your cloud, and you have different applications being applied in that cloud, maybe across different management systems. So you really have to solve for this complexity. You have to reduce complexity, and so we'll talk about how we do that. So this is a use case. It's a customer service provider that's been in production for over a year. They've gone through a number of initial releases, upgrades, multiple releases now, and they're going into their second year, and they're starting to deploy this basically globally across their network. So I'll talk about what their business is, is that it's essentially an MPLS provider. Their customers are geographically located across the world, and their traditional business is connectivity services, right? So they may have Coke or Pepsi in all of their bottling plants, and they're connecting together with MPLS services. And so they looked at this problem and said, we can place compute resources everywhere we have a point of presence, and we can provide compute resources to those customers as part of their MPLS services. So they can put data, they can put processing close to their, let's say, bottling plant, for example. So the challenges presented at us to deploy this over a year and a half ago was that they said, we have this global network. We want to deploy OpenStack from tens of sites to thousands of sites. Anywhere we have access to customers, we want to be able to deploy OpenStack and provide that to our customer. Because our customers are geographically distributed, we want them to be able to order up services through a portal and deploy from a customer perspective anywhere in the world. And we want them to be basically seamless, right? So aside from setting up OpenStack, the day zero and day one work, everything has to be automated. Total flow through visioning completely hands off, right? So customer says, I want compute resources. They click a couple of buttons on the screen. And seconds later or minutes later, compute resources show up connected to their MPLS VPN and it looks like just another site on their customer network. But it's really sitting in the service provider network. So from a technical challenge then, while we have latency, the average latency across the world is roughly 200 to 300 milliseconds. We saw an average about 260 milliseconds. So that really presented a pretty big challenge of how you deal with this problem and the idea that they didn't want to really create special versions of OpenStack. They wanted to be essentially core release trained or on the standard release. They didn't want a lot of customization. They want to be able to take advantage of the community releases of OpenStack. Be able to innovate from a hardware perspective, maybe provide additional services, but really they didn't want a special instance of OpenStack. From a functional perspective, because they are making money from this, they wanted a unified view of that customer, right? They wanted a customer inventory system. They wanted to understand globally what that customer looks like, how many VMs they have, how many networks they have, how to do IPAM across all these sites. They also wanted to offer best placement services, right? So when a customer says, I have some video on demand that I want to deploy, I have these locations, we would basically find the best place maybe either on latency, geolocation, maybe there's some governance around country governance, we would place it. Then they also want to support different types of applications based on their business relationship. I mean, NFE is certainly a killer app in most service providers, but beyond that, compute resources to their enterprise customers, IoT for their wholesale customers, and allow their wholesale customers to be able to evolve based on their IAS infrastructure as well. So from a metadata perspective, they want to have complete understanding of what's going on, just like they normally do in their MPLS network, they want to understand what's going on in their IAS. So if they need to take down an instance of OpenStack or upgrade it, they want to know what customer is associated with that, with that either outage or migration, right? And be able to migrate off of that. So from a business challenge, right? This is always the hard part. We had to do an initial prototype in about 10 weeks, and this would be the basis for the production system. Now, the idea for any prototype is you take a bunch of servers off to the side, you build a prototype, you see if it works. They said, no, we actually want to run this on our production network. So our production PE routers, we're going to be provisioning those with live traffic. This system's going to have to run over it. We saw an average 260 millisecond latency across the globe. They wanted to incorporate that in their test. They also wanted to integrate with their BSS system, right, which is the challenge in itself. And to combine all that, there were five different vendors and other organizations that we were tying into, someone building a service portal or integrating a service portal to their customers, their existing BSS system, their new BSS system, their billing system, the whole ETOM service chain we were integrating with. So a thinly sliced function, but truly a production system and basically in two locations. So after we finish that, we're off to the races. So what it looked like is initially two locations, then over time that they've deployed this across the world, they have a back end to AWS and Azure so they can present their customers of global access to their MPLS, to Azure and AWS. And then basically allow their customers just to order on demand resources, take them down, change them, move around. And that's a pretty successful business for them. So in terms of how we approach this, I'll get to that in a minute. I want to talk about what we looked at trying to take all of those challenges to get to a prototype and then into production. We looked first at what can we do with Bear OpenStack? Can we federate it? Can we replicate data? Can we modify it in some way? And that really became a futile task. The latency and scale of this problem is too complex and OpenStack becomes pretty fragile. So even federate a Keystone at this scale at this latency is not a good option. There's no easy way to separate the domain of control of management when you kind of federate out OpenStack. So the guy in Tokyo is managing it and the guy, if it's geographically distributed, it really kind of breaks their management relationship problem. And it's hard to do rolling upgrades. I mean, when you're talking about thousands of instances, you can't migrate them from Liberty to Mataka all in one night, so you need to be able to think about how you do rolling upgrades. And then there's the issue of, you know, kind of cross-site responsibility. And the biggest problem I would say that we ran into from this perspective is how do you tie OpenStack into a BSS system, right? And what we really found is that the OpenStack APIs were much too complex to be swallowed up, integrated, not let alone one OpenStack, right? Where you've got, you know, EUIDs and all that. But really, how do you do it from, you know, thousands of perspective, right? I mean, the BSS systems really deal with something at a different level, right? So we went on to the next thing, as we looked at TriCycle, we actually think this is a pretty interesting project. It's trying to solve this from this perspective as well. When we got into it, though, it just, the project seemed a little bit too fragile, a little too early for what we're trying to do. We also saw some documented performance issues that probably couldn't be solved very easily and we didn't want to present that to the customer. So for a number of reasons, we kind of put that on hold. Also, it didn't really solve the API problem for integrating with the BSS system. And it just didn't meet the customer timeframe. You know, we could have built upon it and did some things, but we just didn't see a timeframe that we could actually get it in. So we ended up doing something else. I'll talk a little bit about the network. So, and this was talked about in previous, I think it was the previous maybe open stack summit, is that, you know, we all sort of talk about can we distribute at layer two? Well, it turns out that most applications can work with a distributed layer three. And so there's only a few special use cases for layer two, a distributed layer two. And when you think about this complexity in this latency, layer two becomes a problem, not just managing it, but also for some of the applications doing 300 millisecond a layer two, probably doesn't work. We'll come back and look at this over time, but maybe those special use cases, but we decided not to work, try to get layer two to work. So what do we do? Well, trying to reduce complexity, we wanted to keep it simple. We deployed a very, very simple mechanism of deploying open stack. Basically every open stack instance is completely isolated. They don't know about each other, so there's a really sure nothing. That allows the service provider the best possible management, migration path, and innovation of open stack itself. It fits into the organizational very neatly and into the organizational, basically the management, the organization management's right, so they have different regions around the world. Maybe in Sydney, they have outsourced the management of their cloud. In Tokyo, they've got some people there in Africa. They have a service provider that helps them. And it also allows them, what they're thinking about is specialized hardware for specialized applications. So it really allows them to build up different open stack instances and then start playing with different innovations of the hardware, right? So different types of disks, hardware acceleration for maybe video streaming, maybe some hardware acceleration for blockchain, for IoT. And it really allows them to kind of mix and match clouds and provide sort of capabilities and innovate in this sort of a distributed way. From a networking perspective, it actually was pretty easy. They're already an MPLS customer. This fits nicely into their business model. So from a, inside the data center world, we use VXLAN. From a WAN perspective, we used a service provider MPLS and we just tied them together, right? So a global customer has an MPLS VRF and we basically tied their VXLANs into that VRF and let the customer look as if that, those VXLANs are part of their MPLS VPN, essentially. And this is also compatible with, service providers had a pretty unique problem in that. They have these, what's called NNLinks, right? They're partners with other service providers. And so this fit easily into that. So if a customer had a cloud on AT&T integrating in an MPLS VPN with a cloud in their environments, they could basically share the same global distributed cloud. Layer three is pretty well understood. There's really no complex edge cases so it's pretty interesting. So that just left, how do we tie all these open stacks together? Well, we decided to look at this from an orchestration perspective on sitting on top of open stack. And that turned out to really meet the timeframe of the customer and all of the goals of the customer. This allowed us to then create metadata outside open stack that joined all the open stacks together. Metadata about maybe location, about capabilities of this pot of open stack, about hardware in this, about the latency between the cell tower and this site or maybe some cost associated with a particular open stack site. It also allowed us to extend the customer relationship outside of open stack and keystone. So service providers have a very complex view of a customer, right? It's not just their end customer but maybe in their MVNOs or the relationships are selling the service providers that sell the service providers and then resell them customers. So it allows them to have this complex understanding of a customer and then tie all that together across all their open stacks. So a unified view of that customer outside of open stack, unifying and basically proxying to the various keystone identities in each of the open stacks. So we kept that relationship from an orchestration perspective. The next thing to solve was how to tie open stack into their BSS systems, right? So what we ended up developing is a set of really powerful APIs that would do all the things in open stack from a very simplistic perspective or a very powerful perspective, if you will, that could be easily integrated with a service portal or a BSS orchestration system or some other orchestration system. We'll get more into that as time goes on. And it also allows us to integrate with other open stack or non-basically open stack objects, routers and different types of equipment that allowed us to connect into MPLS and we can coordinate that all with the same sort of single API, right? And it also provides essentially unlimited scalability, right? So we're not trying to scale open stack itself and orchestration systems are used to working when tens of thousands of routers, hundreds of thousands of ports, so we can really scale from that perspective. So what does this look like? This is representative of their environment. This is just representing basically two sites, kind of a simple picture. And essentially we have, from a top side, we've got different business service systems talking to a multi-site orchestrator and that multi-site orchestrator then communicates to open stack Nova Neutron from a low level API perspective to do the things that is required from a higher level perspective. We've got an SDN layer doing VXLAN basically into a gateway that then is peering with their PE routers and their VRFs. And then we can then orchestrate, create VRF sites on demand and tying that basically into open stack, that tenant and open stack and then maybe provisioning ports on AWS and Azure to connect that MPLS VPN also into a site into AWS and Azure. So by solving this at a orchestration layer by creating a bunch of standard API as we really solved another problem which was kind of dogged with them that they really didn't know they had until we got into this, one is security, right? So there's no security boundary that goes above the orchestration layer. So no system has access, root access to open stack or admin access to open stack. It's all managed below the line. And that's important for a couple of reasons. So as you think about providing flow through provisioning a service portal that says, I want to provision a VM across my cloud, I may have different types of orchestration systems. Mano and a V which is really important now becomes a client to this orchestration system that can serve up and maybe there's an IoT version of that as well. And the BSS system that can go down. So really having a common set of APIs, the more important problem though, the more important problem is that they don't want to dictate to other service providers what orchestration system to use. So this allows us to create a common standard for multi-site APIs that's AT&T for example can reach in and provision in this service providers environment services just like they do on end services without giving them root access. So this allows them to maintain that relationship with their other service providers from a wholesale perspective. That gives them a whole new opportunities to sell from a wholesale IAS. So what does this look like from an orchestrator perspective? I won't go into the details, but essentially you need a set of common standard agreed upon APIs that you can then talk, provide to other service providers to say this is how you order services from our cloud. You can provide a set of low level APIs but no root access, but really it's the powerful APIs that says connect this customer to this site. And basically go work with the standards bodies to make this a standard where can be working with the MEF to propose a set or a subset of our APIs into the MEF so that other service providers can start thinking about this problem. Also you need a workflow engine really to support an aggregate of objects. If you think about animicity, you basically want to provision a customer in a site and do all the things as an atomic operation. So you need to have a workflow engine to combine that into an atomic operation. And it has to be high throughput and fan out. So some of the stuff they're doing in Tricercle I think would relate to this and so we'll work with them. Also you have to have a database which can extend the ideas of this multi-site or multi-cloud, the metadata beyond what's currently exists in OpenStack. So location and customer inventory, customer relationship, capabilities of different sites, different locations, cost of sites. And then you need to have services like IPAM. So the idea that there are some subnets, neutron subnets in sites that are not managed, that are not managed, they're local to each site. And then there's some layer three subnets that are then IPAM managed across sites. So you need those types of services. Floating IP or what they call GIA for example, you need to be able to manage on a per site basis and push that data into OpenStack from a global perspective and understand the routing, the BGP ASN routing of those IP addresses. And obviously some security you need to build in. So just quickly I'll go through what the APIs might look like or do look like and then I'll run through a quick example. So this is really a customer oriented API language. So it's all about managing the customer and I kinda look at the OpenStack APIs as a set of Lego blocks, right? You can build anything you want with a bunch of small Lego blocks, but this is really a very special purpose. I have a customer and I wanna deploy that customer, I wanna find a location of that customer and I wanna build on top of that customer, right? So customer objects, site, machines, network, storage, and it's all based on the customer. And then to be able to find sites across the globe, you've gotta have location based services, you have to have capabilities, flavors. What are the specifics about each site? Then you can basically present a higher level location oriented API to say find me the best location for this application. So a quick run through, create a customer, that creates a customer globally that doesn't present in an OpenStack. Then you basically find the best location for that customer's request. Let's say there's 100 sites around the globe. Each site is associated and then you instantiate that customer in that site. And this is completely seamless to any user operator, but you create the Keystone user, you create the Nova, you create all the things to get that site up and running and you combine that together across the globe in one set of workflows. Then once that customer is in, then you can support additional services like managed subnets, unmanaged subnets, creating an ACL to give access across that MPLS link and then create any VMs that are needed. You can support things like heats, templates, you can push those down or orchestration like MANO orchestration, for example, and then traditional VM lifecycle. So I'm gonna wrap up here and if there's any questions, but really what we wanna do is take what we've learned from an API perspective and orchestration perspective and bring that back in the community because I think that the deployment of multi-site or hyper-distributed OpenStack, we're really on the cusp of this and I think we wanna get ahead of this in the community and really start thinking about and how to approach this, how to provide the standards for and essentially the leadership for the community. And we think, so we wanna basically find the appropriate project. This might be TriCycle, I think that they have a lot of interesting stuff already. They already have some orchestration capability. We also wanna work with contributors. We're gonna publish our APIs into a project, our designs and some sample code that we've created. And then we're gonna go work with the standards bodies to fine-tune and to present these APIs from an orchestration perspective into the standards bodies. If you want to contact us or help out, be involved, you can contact me at this address or Sarab at his email address and I guess I'll open it up for questions at this point. So the question was, so when we're doing independent OpenStack and each have their own Keystone, how are we presenting a unified view of the customer? That is actually being done outside of OpenStack in the orchestration layer. So the orchestration layer essentially has the knowledge of what that customer looks like from a service provider view, right? The complex relationship. And a key, a foreign key to that tenant, to that user, to that object inside those various OpenStacks, right? And so everything essentially would go through the orchestrator to get to OpenStack. We're not giving access of horizon. In fact, the way, if you imagine, trying to give 100,000 views of horizon to a customer just wouldn't work. So everything goes through a different service portal and a different service chain. We are building a compatibility layer which translates to those OpenStack APIs and we're also presenting a higher level set of unified, right? So there's two, there's a low level set of APIs which are somewhat compatible, mostly compatible with OpenStack that they can use but from a higher level perspective, it's this unified APIs. So the retail customers go through a portal and a sequence of management systems that give them the full capabilities of OpenStack just in a different way. The wholesale customers can use their own management systems, use their own Mano, for example, to tie directly into OpenStack through an orchestration layer that provides both high level and low level APIs. Any other questions? That's correct. So the question was, is that we're using, how do the APIs work? There's standard OpenStack APIs from the orchestrator down to the different versions of OpenStack. So they may have Metaca, Liberty, versions of OpenStack. So we're from a northbound perspective of OpenStack, we're using the standard OpenStack APIs. We're presenting a set of low level APIs that's similar, very similar, almost compatible to OpenStack, above the orchestrator on the northbound, but we also have a set of APIs which are easily consumed by management systems. So there's kind of a dual set. As well as we're working on things like Tosca, being able to send a document, for example, and have the orchestrator do all the things necessary in that Tosca document. Yeah, yeah, sure. Every OpenStack instance is completely managed by a different group, a different person, and it has a different release cycle, has different hardware, and that's the job of the orchestrator is to be able to be compatible and be able to understand the difference between the versions of OpenStack. And this is, if you talk to anybody that understands orchestration at an OS level, this is pretty commonplace. We're used to dealing with different versions of Cisco routers, with different versions of OS of Cisco, with different capabilities of Cisco, right? So it's a very similar problem. Element management system, the EMS. So from a VNF perspective, they're essentially provisioning, they're sitting on top of multi-sites and they're saying I want a VNF, let's say SD-WAN, it's Sydney and in New York, right? To sit over the top. The VNF, so there is a MANO system that actually sits on top of multi-sites and it communicates with our APIs and says, and so it's a standard, I mean, there's one of the five standards, right? Of MANO, implementations of MANO. And so, the EMS for, let's say, a router that sits on top. So multi, our orchestrator can do that. So it is possible to do that. We are not doing that today. Next question over here. Our orchestration tool is hyper-distributed. So it's completely clustered and it can sit in a single location or it can sit in multiple locations. It's a deployment choice. Yeah, you can stop by our booth and we can tell you about the details of it. Okay, one last question. I didn't hear your question, sorry. I didn't hear your question. Just speak louder. If we are talking about monitoring, yes? Monitoring, yes. So you have a multiple, how to say, common platform for multiple sites. I mean, the orchestration will be, you have the same point as your orchestration for monitoring system. So they have distributed. Yeah, so they have a separate network monitoring and compute monitoring, basically log management. We are taking some telemetry data. So we're pulling in telemetry data that we need, but they also have a global monitoring system that does collection and distributes the events to the different locations. So they do have a pretty sophisticated set of management tools already that we're tying into. So does that answer your question? Well, more or less, and another small question. From the orchestration perspective, if you collect something, as you said, as you collect something from your infrastructure, yes, does your orchestration system able to perform kind of a policy based on collected information? Right, right. So the idea is that that definition of find the best place to deploy a resource, it takes into account a number of things. Some of that is locations, some of that is latency to the customer request, some of that is capacity, some of that is hardware. So let's say they have a pod of hardware accelerated for blockchain for IoT, right? That you can basically, we have metadata that describes that. And also the capacity, the utilization of that open stack, right? So how much utilize, how many compute, how much capacity does that open stack have? Things, because I just think for myself, it's pretty interesting, because I believe it's possible to, technically it's possible to collect some information and propagate some rules to infrastructure. But what I also think it's pretty complex to map your business through, convert your business through into technical rules, yes, which you want to install in your infrastructure. So that's why this, yeah. Completely. And so I'm only showing you a very small portion of their system, right? Focused on open stack. There's no one orchestration system that can solve everything. That's why service providers typically have a whole level of applications that integrate together. That's why one of the reasons for an orchestration, a domain orchestration that's just dealing with a distributed open stack, for example, is that they can bubble up the data needed to present to a higher level system that can make more aggregate rules, right? There's no way you would be able to put Keystone, UUID, and all of this in a business system, right? It's just, it's too complex. It would fall over. Thank you. Thank you for all attending. And if you want to get involved, stop by our booth at C5. C5, thank you.