 Yeah, we can start. All right. So let's get started. Welcome, everyone. Today we will be talking about OpenStack brings Kubernetes to the edge. My name is Emilia Macchi. I work for Red Hat in the engineering. My current focus is running Kubernetes on top of OpenStack with the NFV and Edge use cases. I am Maiza Dimasedo Souza. I'm also an engineer working at Red Hat. And I work with Emilia on Kubernetes on top of OpenStack. My focus is on the installer and container networking. And I'm Christian Ishevsky. I'm a principal solutions architect, part of the field engineering. I do a lot of prototypes, prototyping with the customers, anything data center related. I consider myself a client advocate. So I take anything that customer throws at me and try to stream it to these guys to maybe help them prioritize and look how to improve the products. So for the agenda, we are going to give a brief introduction about Edge computing. Then we are going to follow in more deep dive towards Kubernetes at the Edge with OpenStack. Go through some of the proposed architectures, how it works under the hood with services like compute, networking, and storage. And also go through the roadmap. And Chris will go over some lessons learned from the field and then questions and answers. I would like to highlight here that there will be some intersections between what is actually supported upstream OpenStack and what is available downstream at Red Hat. So we have seen that data and applications that use data has increased a lot over the last years, especially with the pandemic. So business had to digitize, if they were not digitized yet, and they have to increase their capacity in order to keep up with the demand. And if by any chance the data that they are using, it's critical data or requires a fast response or it requires high availability, then definitely Edge could help out in this sense. Because the data with Edge, it's processed closer to the actual source of the data, so at Edge sites. And this facilitates because there is no need for the traffic to go all the way to the core and back in order to have some minimal response. So Edge is everywhere. And when we think about the use cases, for example, with automotive, if we have trucks getting close to each other, then they can communicate and avoid a possible accident or, for example, finance if there is no need for the data to go all the way to the core and back. So it does avoid security leaks and so on. And also, like with health care, there are dispositives that can be used to monitor patients. And the health care assistant can reach out faster and provide a better care in the fastest way. Emily? Thank you, Maiza. So of course, today we are here to talk about Edge. And I think we are also here to discuss about the new distributed architectures we developed for OpenStack. And today we will present two architectures. I will do one, and Chris will do the other one. This one is about the geographical distributed Edge architecture, something that we also call DCN, which means Distributed Compute Nodes. This is really the architecture that you will see when the compute sites would be remote. Like we are talking about big or small data centers that will run your workloads at the Edge. So one of the advantage of this architecture is that we have been working on this for many years. It has been proof tested. It has evolved based on all the needs that we have seen on the field. Of course, because it's proof tested, it's much more simple to deploy. So one of the things I would like to highlight in this architecture is that the OpenStack control plane would live within one availability zone, which means one physical area. And the downside of it is that, of course, the connectivity is lost between this area and the other sites. The control plane would not be available for some time, but the workloads would be running fine. Speaking of the workloads, we are talking about Kubernetes here. So in this architecture, we recommend to deploy one Kubernetes cluster per site, or many clusters per site. But you would not want to stretch your Kubernetes clusters across many sites. That's going to be something we will discuss next. I want to talk about the latency here, because in this architecture, we are talking about the RTT, the return time traffic, which is usually around 100 milliseconds in theory, even though we have discussed earlier that it can go a bit above this. But usually, we talk about those numbers. And one of the things I would like to highlight is, and we will go into details later, but in this type of architecture, because of the geographic remote distance, you want to push your data on the remote sites. So I'm talking about the glance images, the signal volumes, and those kind of things that we will detail a bit later in this presentation. So now, Chris, over to you. Thanks. Thanks, Emilien. So this architecture is really cool. But what if I tell you, you can take the same architecture, and then with a few tweaks, push it to the edge of your data center. So you can ask me, why would anyone do that? The edge is really for something outside of my main data center. So this is something that we see customers are trying to do all the time, especially if your workload, like availability of your workload, or lack of availability of your workload, costs that company a lot of money. So financial institutions, trading institutions, et cetera. Every second when the application is down, it matters. It costs them sometimes millions of dollars. So with this little twist, the main goal is to maximize the number of nines in your SLA. So being able to make it as available as possible. So again, the idea behind this architecture is ability to kill any of the network fabrics in your data center, and still have the resiliency across the entire deployment. And there's some considerations to do. So you don't want to distribute this architecture geographically, because the latency, as long as the latency is good, we should be fine. But the latency requirement is quite important. If you guys are familiar with how the Kubernetes worked and the SCD, I think the requirement is SCD wants you to have a single-digit latency between the nodes. Maybe a double-digit is OK. But that's kind of the main consideration when you architect that. From the OpenStack perspective, you also want to have all these services available across multiple availability zones. So again, in this architecture, you're trying to maybe replicate what public clouds have been doing for a while, where you have a multiple availability zones attached maybe to different power distribution units, and then stretch your Kubernetes across, and then have those workloads be resilience from one AZ to the other. So for the OpenStack control plane, also today, you want to stretch it out over Layer 2. That's one of the limitations that is getting addressed. We're going to talk about that in the roadmap. If you haven't seen Luis's presentation yesterday about the BGP feature, I highly recommend to go back and see the recording. This is something we do to kind of address this issue. So now we're going to dive deep into what's under the hood. I think Misa mentioned there's some of the limitations and considerations that are attached to the triple O project, or the IPI installer for the OpenShift, which is the Kubernetes distributions that we work on at Red Hat. Not all of these limitations or considerations will be the same for your upstream project. But I just want to make that distinction here. So let's start with the compute. We're going to go to network and storage later. So from the compute perspective, one of the features that a lot of customers are asking us for is the live migration. So live migration is available in this architecture. One caveat is you are able to live migrate the workloads within the same availability zone. If you need to move the workloads from one AZ to the other, it's going to be a cold migration. There's some disruption to the workload itself or outage the workload itself if you want to do that. A lot of companies out there, they implement this architecture to take advantage of the hardware accelerators of all type of sort, DPDK, SIOV, maybe GPU or VVGPU. So all of these features are available for you to use. We see a lot of customers deploying not just Kubernetes on virtual machines, right? But to take full advantage of, let's say, the bare metal resources or assets they have, they would put, let's say, Kubernetes workers straight on the bare metal with Ironic and something like maybe Masters and InfraNotes on the VMs. And that's definitely available in this architecture. Again, from the installer perspective, and by installer I mean the TripleO, the OpenStack TripleO, as well as the IPI for the Kubernetes or the OpenShift, the full lifecycle automation is there. You can do a push button deployment. You can even do the zero touch provisioning and lifecycle as well with some caveats. And then we see a lot of customers, they cannot extend the DHCP services to the edge. And that's also supported and possible. There is an option to do the pre-provision notes. It has some considerations from the automation perspective, but in general, this is something that we've seen in the field as well. I'm going to pass it to Mesa. Thank you, Chris. So can I see a show of hands if there is anyone here that uses traditional three-layer architecture still? No. Everyone in the spine leave already? Wow, you guys are good. All right. Well, in any case, if anyone was shy, I would just go through why spine leave can be considered better than the traditional three-layers to overcome some of the limitations which comes with the traditional three-layer architecture. So the traditional three-layers is based on north-south type of traffic. And with clouds and containers, workloads, the most common type of traffic that's available, that's happening there, it's with east-west. And then spine leave can be a better fit for that specific topic. And when we analyze how the, for example, if a server, a packet, travels from location one, it has to go through two aggregation switches here and here, and one core in order to communicate with another server in location two. So this increase the latency and also creates traffic bottlenecks. But with spine leave architecture, there is only one hop in order to go to another server and communicate. So there is a predictable latency. And it's simple, of course, to expand. You can add more spines if you want more throughput or more leaves if there are more users accessing it. And at the same time, the failure domain can be as related to a leaf. And since all the leaves are connected to all the spines, it creates a non-blocking traffic. So if we bring the benefit of spine leave towards workloads, we can think about routed provider networks from Neutron. And what routed provider networks brings is that the user would have one layer three network and which will hold multiple L2 segments. And each of those segments, they would have one Neutron subnet assigned to it. So each leaf would have a sider. And of course, when booting an instance, either VM or bare metal, the NOVA schedule would attach that specific instance to the right segment. And the user can also specify a NOVA lazy when creating that specific instance or using a pre-existent port. So we will focus right now on two of the networks that usually can be used with spine leaf architecture. So for provider networks, you can have one large L2 network. But this brings a few complications and considerations to take into account. Because for example, if multiple addresses, multiple endpoints are arping for addresses, this can create domain failure. But at the same time, you could have multiple smaller L2 networks, which would reduce this issue. But this would be complicated for the user to know where exactly that provider network is available. So we can end, of course, like provider network. It's highly performant. And it's provided to the user by the infrastructure admin. But when we think about provider networks, it can be a better fit in the sense it's highly scalable. And there will be one segmentation per site, in other words, leaf. And there is actually no confusion on the user site to know where exactly that provider network is available. So he or she would just get one network, and it would be transparent to them. And of course, there is smaller failure domains. And it's, again, one of networks which is managed by the infrastructure admin. And now I'll hand it over to the admin. So let's talk about storage a little bit. We have a few slides about storage. Let's start with Glens. So earlier before, when we talked about the geographically distributed architecture, we talked about move your data close to the workloads. And one of the data that you have, of course, is the Glens image for your OpenStack machines, of course. And so one of the key points here is that you have to think about the type of workloads that you run and how often does the images have to be updated. So in this slide, I presented two ways of doing. Of course, there are more ways. But to summarize, I think you either want to cache the image on the remote site or to pre-import the image before deploying your workload. So caching is very useful when you have a Glens image that has a short life. If you have a workload that has a security constraint and has to be updated every day or something like this, you would probably consider deploying your workloads once on the site and the image would be cached. Or you could actually, there is some work ongoing. We will see that later. But you would pre-cache the image. But in the case of Kubernetes at the edge, what we do is we usually import the image using Glens. I don't know if you can see it, but there is a CLI that basically allows you to import the image from an HTTP server or something else and push it directly through the remote edge sites that you want to deploy your workloads. So that's something that will be useful once you get to deploy your Kubernetes workers, because the image will already be available on the remote sites. So that's what we highly recommend for Kubernetes running on the edge sites. Also, just to mention that the raw image, they can be pretty large size. And of course, at the edge, you don't want to transfer a lot of data, because many reasons. But there is an option that you can use to sparse the image during the transfer. And of course, that's way more efficient. That's an option available in OpenStack. Let's talk about the volumes in Kubernetes. Of course, this is a 10,000 feet slide about the PVCs in the Kubernetes. But in this one, we want to present kind of like how Cinder and Manila can be really useful for Kubernetes clusters at the edge. I'm not going to present the details about how it works in Kubernetes. But what I want to share with you is that having Cinder and Manila deployed on the edge sites, managing the storage, like the local storage, is something that we highly recommend. So having your data close to the workloads, in the Kubernetes CSI drivers, both for Cinder and Manila, you have an option to select the availability zone of the volume that your workloads needs to use. And if you don't provide that, there are some mechanisms in Kubernetes that are named the topology-aware hints that will try to figure out in which zone the volume has to be created. So yeah. And the next slide is about the container image registry. So when you deploy Kubernetes, at some point, you will want to deploy the control plane, and then your workloads on top of it. And many distros out there, they have their own container registry. We have one in OpenShift. But I also put some names here. We have Arbor, Key, some other registries. One of the key points here is, again, like we said before for Glent and Cinder and Manila, for the container image registry, you have the container images also very close to the workloads at the edge. So if possible, we suggest to have this registry at the edge using the local storage. And one of the things that we do in our distro is that we use the CSI drivers for plugging the image registry to the local storage in Cinder, which is available at the availability zone that we want. So then you can scale the registry nodes, and they will use the storage that is available on site. I don't mention that in the slide. I will just tell you that there are some options as well to replicate the registries between sites, which can be helpful if you're deploying a very large-scale deployment, and you want to pre-populate the image registries. But we don't do that right now. It's very complex, and you have many options available, but you can keep that in mind as well. Chris, over to you for the roadmap. Yeah, thank you so much. So first, I want to say you can see, based on the diagrams and the architecture we presented, this solution is ready today. What Maesa, Emilian, and the whole Red Hat team is trying to do is close the gap on some of the inconveniences that are there. The whole idea from the customer perspective is minimize the amount of data or the network data that has to go over the core or over the spines in your spine-leaf architecture. And also minimize the amount of services that rely its availability on a single AZ. So this is kind of a focus. These are some of the features that will help us to remedy that. Today, things like Manila, Octavia, or Designate, so load balancers as a service, DNS as a service, and some of the storage services, they do rely to be in a single AZ. So that kind of makes it might create a single point of failure if you don't design it properly. So we're addressing that going down the road. Again, I mentioned the BGP feature that we are developing in the upstream and then merging it down to our automation down the road. This will allow us to break our control plane, open stack control plane to live across the L3. So it's in its own L2 domains spread across L3. And there's many more features helping with the caching or providing the data at the edge. So these are some of the features that are coming into the upstream open stack. And then also to the product as well. And then we want to be respectful of everyone's time. This is the last slide. I just wanted to share some of the lessons we've learned over the course of implementing these edge architectures with the customers. So first of all, I'm doing a lot of prototypes, as I mentioned at the very beginning. And I feel this is super critical in everyone's evaluation of this technology, of the solutions. The customers that I've seen most successful with what they do, they usually gather before I get engaged. They gather teams from security, from networking, from storage, from backup, et cetera, all of the teams in their company. And they put together a list of the use cases that they need to validate within this implementation. So I don't know. For example, the platform need to have ability to secure a port and you have a way to review the logs, et cetera, of that feature. So if you're one of these people who are in the chair of trying to bring it to your company, build a lab, do that list, and make sure you validate maybe a 60% set the target. Maybe 60% of all of the use cases needs to be validated before you can start rolling it into production. What we also found out is the latency that is recommended by Redhead from the product perspective is 100 milliseconds round trip between the sites in this geographically distributed format. But that might not be the case for everyone. And you don't have to deploy it in a production or in a live environment to figure it out if that's the case for you or not. What we do, we simulate the environments in the labs. And you can inject the latencies in your spines, if you want, and see if your workload or if that infrastructure still holds. From the execution perspective, don't be a snowflake. And this is generally true for any architecture. But the complexity of the lifecycle of both of these solutions, Kubernetes by itself can be complex. So it's open stack. So you want to keep the variables to the minimum. Just try to stay with what I would call a reference architecture and try to minimize it. Don't be a hero. Don't try to implement some fancy features that maybe are not vetted out. But just stick and try to make it as simple as possible. In your life, I can promise down the road, your life is going to be much easier. And then, again, I mentioned some of the services are relying on a single availability zone today. What I see some of the successful customers are doing, they're splitting it into those services into a different type of SLAs model. So for example, they have a service that's a gold standard, silver, and platinum. So that's a good way of kind of handling. If you want to use this service, you're not going to get the same SLA as you would get for some other service like Nova, which is usually the most resilient. But that kind of wraps it up. I think we have maybe a minute or two for questions if there's any. But otherwise, thank you so much for coming. No questions. All right, thank you. Enjoy the rest of the summit. All right. Can I get a show of hands? Who came just because the title sounded cool? Ah, all right. Pretty good for marketing. Welcome. My name is Tyler Stahecke, and I'm here with that to share with you some of the discoveries that we've made at Bloomberg in scaling BCC, or Bloomberg Compute Cloud, our private cloud compute platform. We've made a couple of different architectural decisions than probably most other OpenStack clouds, and that's led to different kinds of scaling problems and things like that. So in the talk, we'll go over some of the higher level decisions that we made, why we made them, some interesting edge cases, many war stories, core components of OpenStack that we've had to scale or make changes to, as well as long as what we face today in terms of challenges and problems going forward. So just a quick little bit about me. As I mentioned, I'm a cloud infrastructure engineer at Bloomberg that helps architect our private cloud. We're currently running OpenStack use Siri for all components, as well as Chef Octopus in our production clouds, both of which I help to upgrade without any downtime for any of our users. And a lot of the content, or at least some of the content that we're about to go over, is things that we needed to do and shore up and actually make those live upgrades successful and or things that were discovered during the upgrade process itself. So at Bloomberg, we use OpenStack to run some very large single-cell cells, single-cell clouds, excuse me. And I'll explain the why of the single-cell aspect in just a second. But we have cells with over 100,000 virtual cores and at about 1,000 computes that we continue to grow quite rapidly. So conventionally, these are probably pretty large cells. Most people do multi-cell or something like that. We did clip some hurdles scaling past 10,000 VMs or so, and we'll dive into some of what those issues were and how we address them. But yeah, we also run some pretty hefty Chef clusters. As I mentioned just a bit ago, we currently only offer Chef-based block storage in our clouds, though we are looking at alternatives, especially for highly synchronous and self-replicating applications like Etsy, DM, MySQL, and things of that nature. So as I just mentioned, we do not use multiple cells within a single cloud, nor do we use regions. We evaluated both of them when we were deciding to scale out the clouds, but ultimately decided against either, and that was for two reasons. One, regions, things like quota management with regions can be a little confusing, especially if you've only used the default region. We've used OpenStack internally since Essex, and we've never had to introduce the region's concept formally to our users. So getting our users to make changes to their clients, their workflows, things like that to make use of regions wasn't something we wanted to do unless we really had to. And then with multiple regions, it implies that you have a stretched keystone or a stretched database and things like that. And we wanted to share nothing design. We don't want to have fault domains across clouds. Clouds are distinct boundaries for us. And same thing with the cells. We decided against them too. And for a lot of reasons, similar to regions. So with cells, things like Neutron are documented to be global across all your cells. So that's another single point of failure. It's another single thing that you have to scale. And while it does offer some benefits in terms of allowing you to break up Nova a little bit, we don't really have any challenges scaling or keeping Nova stable. So cells were not really a good choice for us either. So it really boils down to we wanted a cloud that's simple, both to troubleshoot, upgrade, things like that for our users. And so we opted to basically run a small number of cells that just have a shared nothing design between them and scale them up quite large vertically. So in terms of designing for scale network architecture choices that we made and things of that nature, R4A, as I mentioned in OpenStack, started with SX back in 2013. And much like other OpenStack clouds of that era, we were using Nova Network with a Layer 2 network underneath it. And that, I'm going to oversimplify, but it was just two spines, simple Layer 2 fabric. And over the years, there were changes to this approach. We didn't just keep the same thing, but the design was nevertheless Layer 2 at its core. And as the cloud grew and users started using it more and moving to it, we did have a few runs with this approach. Some of the cloud just grew bigger than the network design could really support. Lots of time was spent looking into us, network scaling issues along with the network team, so take load balancers, for example. Sometimes what we would have happen was, say, a user would decommission a bunch of their back end servers without removing things from the front end load balancer configuration. Whether they were using a software or hardware load balancer doesn't necessarily matter. You would get a broadcast domain with a bunch of ARPs because that load balancer is trying to reach out to those back ends, and they're not responding. So not to say that Layer 2 network is broken or that we couldn't have done better. It's one of those things where hindsight is easy. You're scaling a cloud. It's hard to keep tabs of all these challenges. And so in 2018, we began rearchitecting for scale to solve a lot of these problems and get us out of the picture of operationally maintaining the clouds. And what we chose was a Layer 3 BGP-based IP fabric. So in this design, the hypervisors do not bridge. They route. And what that means is all the Layer 2 complexity just goes out the window. We don't have multicast. We don't have VRP. We don't have large broadcast domains with ARP traffic to worry about. It's just all gone. We only worry about things at Layer 3. So on a higher level, if you're not familiar with BGP, how this works is once you start up a VM, it announces a route to the hypervisor. The hypervisor then uses BGP to announce that route into the fabric. And then basically all the other network elements learn the best path to that route. So we achieve redundancy and load balancing through equal cost multi-path routing, which uses Layer 3 or Layer 4 hashing to determine which of the links which are up to route the traffic down. And what's cool about this network design is that you can really go as narrow or as wide as you want. We actually have different networks or different clouds with different degrees of parallelism, things like that out in production right now. And what's even cooler is how capable this fabric is at scale and how well it works with things like SAF. It's not just something we use for OpenStack. So I'll go over some numbers to give you a taste. But really quickly, we pushed over a terabit per second of traffic across each network plane as of recently. So we have a ridiculous amount of bandwidth in each one of these clouds. So as far as from an OpenStack perspective, how this works, we don't use either OpenVswitch nor do we use ML2. We use the Calico Networking Driver from Project Calico. And so when a neutron port is created, a tap gets created on the hypervisor, and there's a route for that tap that, again, as I said, gets announced through BGP. So one other thing that's worth mentioning is that in addition to each host serving as a router, it also serves as a distributed firewall. So each of the nodes within our network is actually doing IP tables enforcement. And so we don't have a single choke point yet with single core routers or anything like that. Yet we can have a very distributed, rich networking policy that we can enforce by virtue that it's so distributed. And Calico does have some limitations which may make it impractical for some clouds. Again, similar reason for us not using Regents. Your users may be used to having L2 adjacency. We don't have that. The latency does tend to be a little bit higher, your latency floor than OBS, things of that nature. But on the other hand, Calico can really simplify your deployment. And one thing that we really like about it is you generally have just a few provider networks and users don't have to worry about subnets and routers and things of that nature. We just let them go and create things in those networks without having to worry about any of the specific details. And so now the numbers to give you an idea of how scalable this network fabric really is. In a single weekend in one of the clouds, we did have some spare acts that came in through delivery. And what we decided to do was actually migrate every single VM in the cloud. That was to upgrade the kernels, the firmwares, Kwemi versions, things like that. And we were able to do it without an issue. The bandwidth was ample enough to supply that. We didn't cause any impact to the user's VMs. And then as recently as last week, we made some rack level crush map changes in our SEP clusters. And we moved a petabyte of objects in just a few hours, all while servicing tens of thousands of RBD volumes. And during that time, SEP had sustained recovery speeds between 130 to 200 gigabytes, not bits, bytes per second. So the network fabric really is what underpins the scalability for these clouds. And again, it's layer three. So we don't have all those weird layer two problems that creep up and sometimes present operational issues. That's not to say we clip some hurdles when we first productionize these. The first issue being the BGP timers, we ran them quite aggressively to detect failures quickly. We had some issues with old versions of Bird at the time, distros of packaging, major version 1. Now everyone's using major version 2, which fixes a lot of the issues we saw. We didn't consider the implications of strict route filters. It makes more sense to use loose reverse path filtering within the network, which is not Linux's default policy by design. And finally, we had some live migration run-ins with Calico. We fixed all these. We've pushed up PRs to Calico. So if you haven't used Calico recently or you've used it in the past, didn't have good experience with this, give it a try, they should all be fixed upstream now. We also worked with Tiger and a follow up PR. So aside from the network architecture being great for open stack and stuff with the amount of bandwidth can provide, there's another really neat feature. With layer three networks, we have a trait called resilient hashing that we can leverage. And what that is is basically, OK, I said we have ECMP to distribute traffic across both legs. Well, the nice trait of that is that when we do that hashing, it's consistent. So if you look at a TCP flow, your source destination port, source address, destination address, those are all fixed. So whatever link that traffic is hashtown, it will remain down that link. And you can keep consistent TCP connections open without having it bounce on different paths. And so you don't need things like a liter of IP anymore. In layer two networks, we use VRP to, if you keep a live v to select a web server that's a primary, we can just announce routes to multiple web servers. And VMs will route traffic to all of them. We don't have to worry about, we can use BGP for those kinds of things. So that's where interesting implications like Octavia come up that we're looking at leveraging in the future. So as far as operationalizing the cloud and some of the issues we've run into, you have to have metrics. So like Bloomberg often says, if you can't measure it, you can't manage it. And it's core to everything we do. We monitor dozens of metrics from each service on the control plane and have a lot of alarm conditions for the same that help drive our success. And so you have to ask questions, like, do you know how long each VM is taking to spin up within your clouds? Do you watch capacity metrics? Are you ordering in advance? Things like this in the post-COVID era are really important. And metrics also tell us how users are using the clouds, which tendencies are doing batch jobs, which ones are not. And so we take all those decisions and are able to leverage them to our advantage. And by putting this aside, let's look at some of the interesting things that we've found. So this one was instance spawn time. We started seeing the maximum VM spawn times creep up, especially of a long period of time when users were requesting larger batches of VM. And we would see database pressure and the control plane go hot. And I'm embarrassed to admit it, but yes, at times, instances were taking several minutes to spawn. However, around the time it started getting really bad, we were already ahead of the issue. We already knew it because we were watching the spawn times of the instances. And we were just around the corner of the fix. And what that was, was Neutron API. So we looked at our logging systems and the API latency of things. And we pinned it down to Neutron, started looking at the network list. And again, we only have a few provider networks. But when you did a network list, Neutron would take 10, 20, 30 seconds to give you a result. So something wasn't right there. And we started profiling the code and found out it's probably related to the Neutron RBAC evaluation. And once we fixed that, queries that were taking 10 seconds were finishing multitudes of times faster. So we moved that code to our staging cluster. And when we did, we got new alarms. But it was because we exhausted the IP space so fast. We ran out of IPs to spawn new ports. So please forgive me if you look at this patch. It is abandoned. I do intend to get it into the tree. I've been learning things myself this conference about Zool and how to make that happen. So another teething one that's funny to look back at now, during the weekend upgrade of one of our SAP production clusters when we were upgrading from, oops, sorry, wrong slide. This was SCD gateway. So when we were upgrading from Rocky to Stein, we were doing a Python 2 to Python 3 transition. And we started seeing some functional test errors creep in just now and then. And what we were able to do is basically deduce that to a problem in SCD 3 gateway where the watch semantics were broken in Python 3. So we pushed that issue up. If you do use Calico, make sure that you are using an SCD 3 gateway version that has that fix. It will make it run a lot better. We also saw problems with SAP during the upgrade when we went from Mimic to Octopus. Didn't show up on the staging clusters. Only really showed up on the really large production clusters. And fortunately, it was not an issue that presented a problem, per se. Users didn't notice it. We only noticed it because we were looking at the metrics. The tail latency was quite bad. And so when we started profiling the OSDs, what we found, and this was a SEF that was stable for a year. It was 15 to 12. We still found this issue that late into the stable release. And it was the Blockpecker algorithm. So we were able to determine that it was a new algorithm that was used in Octopus that was not in Mimic. And we could back out to the one used in Mimic. We didn't have to downgrade SEF. We just bounced the OSDs. And since then, we've worked with Canonical to fix the underlying issue with them. And the fixes have been backported to Octopus. So if you're using Octopus or using SEF, you may have benefited from this yourself. So this is an interesting one. One day, we saw alarms for CPU steel going high on an instance. And we correlated it to the processor clock being stuck under a gigahertz, which is weird, because we disable P states and C states. They should run at top frequency all the time. So we started time charting the frequency of the slowest core into our time series database. And what we found is that some hypervisors for four minutes, five minutes at a time, would just drop down under a gigahertz. And we could go to the vendor, show them this data, and say something's not right with the platform. Before we alarm anyone, it is a rare issue. You don't have to go in and do this yourself necessarily. But it goes to show you what metrics will tell you about your infrastructure. You probably wouldn't catch the CPU dipping down for just a few minutes at a time if you weren't explicitly looking for it. And finally, this last one. So you might look at a graph like this and say, it's less than 1% network retransmissions. Why am I worried about this? Well, it was a new hypervisor build out. We hadn't put it online yet, fortunately. We started looking at this. And we've seen isolated cases where we set up a switch for Jumbo frames, 9K MTU. And even though it says that, the configuration's there, you read it back, it's 9K. It functionally acts like 1,500. So if you look on the left, we set the boundary condition for the MSS just a bit over 1,500 for 1,500 MTU. And you can see it's a couple of megabits a second. As soon as we drop it down so that we're limiting the frame size, full line rate. You wouldn't see this, probably, if you put it in production. Users stumbled on it. But with metrics, we can see this before we get this in production. This is like nightmare fuel if you run a self-cluster and put a bunch of OSDs on. The network is not doing so hot. So for the next bit, we'll talk about scaling issue that was unique to large Nova cells that we ran into. And we basically saw cache templates. If you're not familiar with what this is, think of a web server farm that's fetching pages out of memcache when they're normally, they're already pre-rendered. Eventually, that memcache, it expires out or you restart it because you're restarting your farm. Then you have a bunch of servers all at the same time that no longer can use that cache. And they're all going cold rendering pages. And what you see is congestion collapse. The whole thing just caves in on itself. And it can't really catch up because it's just so busy trying to periodically re-render these pages and things that are out. So we found that the Nova metadata API can be, when your very large cells can fall into this congestion collapse condition. And so here's what it looks like, at least for us. And one disclaimer, this may look different if you use Neutron metadata proxies or if you're using some of the newer features in Nova Stein. So in a study state, this works well. VMs go through, mostly hit the cache. What doesn't hits a few API calls and you're on your way. However, when everything starts hitting cache misses, like I said, then everything goes straight through. So you just have tons of VMs making back-to-back serial chains all across your Nova and Neutron services. And so you have VMs that vastly outnumber the number of control plane elements in most cases. And it can just overwhelm things, especially the database. So what we had done is to optimize that flow path. And I'll go over what exactly we did. But the cold request hit latency or the request latency to generate something out of the metadata, we dropped it by a factor of 10, which dropped our database load in half. So things like this allow us to continue scaling out the cells. So going back to this original, here's the whole flow path. These queries can scale linearly with the instances in your cloud. The first thing we did is we looked at, do you really need to look up the instance UID? So one of the things that we can leverage that probably doesn't work for a generic OpenStack cloud is we can look at the routes to the taps on the hypervisors. And we can deduce which tap has which instance IP associated with it. And then we can go to Libvert and see which of those tap interfaces is related to which instance ID. And so we can basically distribute this database live across all of the hypervisors and just do local queries to determine how to map from that instance IP to a UID. And so if we do that, we can just get rid of that call. So we also use single cell clouds. And there's a feature in Novastain which you may want to enable in your deployment if you can do it. It's certain cases, but this local metadata per cell, you can basically memoize the cell's UID as well. You don't have to look it up for each instance. So we can get rid of that call. Another one is we, the metadata has the names of security groups in it. We don't really use the security groups in productions. We have an edge case use of Calico that we're happy with. So we were able to get rid of that call as well. And so this is our whole metadata path now. It's just a single call to the conductor. And when we did this, we saw RabbitMQ load go way down. Database load goes way down. Control plane load goes way down. Everything's a lot calmer. So one thing to take note is if you're doing this, make sure that it's not just do you have enough memory and memcached. Is it spread out far enough? Because if you only have two or three memcached instances in your control plane and you restart one, you just threw it a third of your cache and then you get the stampede effect. So the one thing that we did is to distribute a small memcached across all our hypervisors. And they all serve as small metadata caches. That way we can bounce them very slowly in a controlled fashion and we don't hit ourselves with this condition. And although I mentioned we optimized the path significantly, we're not really solving the problem of a stampede. So this is future work. But basically what we can do is we know the period over which metadata TTL's out. And so we can have a thread that just basically pre-renders things in advance of them expiring out over the whole window. And this way everything's always cached. We don't ever have a cold hit. And so this gives us basically total immunity to stampedes and predictable API latency. However, it increases the load when the control plane is otherwise idle. So it's a decision for up to you. I don't know if something upstream would accept. Scaling things vertically. So disaggregating your control plane. A lot of people new to OpenStack will start out like this. Like we did, people start using your cloud. It gets hot. And then you start doing some stuff work, maybe some other services on your control plane get hot. And it starts kind of blowing up. This is exactly what happened to us. So here's some CPU utilization figures over some time. We started out at 40%. And you can see we were bursting up towards 90. So kind of quick things you can do. Disaggregating your control plane, just moving rabbit off. If you have containers, this is probably trivial. But even if you don't have containers, you can just basically stretch your rabbit cluster, point all of your configurations towards the new nodes you stretch, and then just remove the old ones. The only thing to be careful of is make sure you keep your cues synchronized when you do this. So really quick, in a bind, just move rabbit off to some other nodes and farmed out our control plane a little bit, and we were good to go. So as I said, if you do this with, you say I might have containers, I'm not subject to these kinds of things. This was a relatively easy case. So let's look at the Cephmon and Cephmanager, another component you might have. Same thing, right? You just stretch out those Cephead components across the cluster and decom the old ones. It's easy, right? Well, no, if you do this, all live migrations and reboots will break. And why it's because Cinder, when you attach a volume, it stores attachment information along with it. It stores the IPs of those Cepheds. It doesn't store the host names. So if you move the IPs, it's gonna try and build a connection with IPs that are no longer hosting Ceph services. And when you do this, elements of your control plane will hang. But we've also run into issues where we tried to disable trim, and we find that the trimmed flag is now embedded into the attachment information. So it's not very easy to disable. There are things you can do to fix this, but it's something you should be aware of that Cinder keeps this information. It's not always as easy as just moving things around to different IPs. So as far as disaggregating things, how we do this is basically, we build a new set of Cepheds to the right. Pretend this is a VM over at the left with our current three heads. The first thing we do is we decom just one of them if we want to go from three to five and we build the three new ones. So we're still having odd number of Cepheds. Things are a little all over the place, but what this looks like is this. So we can move, now migrate the VM, we can use a custom Nova patch to basically say, point to these ones. Disregard what's in the attachment information. And then from that point on, it will use those ones as long as you update the information on the database as well, but you can live migrate the instances. And then you can go decom the old ones later. And now you're safe. You could live migrate them, you don't have to worry about it. If you're paying attention though, there will be some point in time where you spin up instances and you'll have a pointing to two from the old cloud and three from the new cloud. And all you have to do is just do basically repeat the process once more for any instance that got created in that window of time which you were manually specifying the Cephips that you're moving to. And then afterwards, all instances will be moved over to the new Ceph cluster. So we've done this in our staging cloud. It is a process which we've vetted and tested and it works, but just make sure you're not changing around Cephips or you will get yourself into a lot of trouble. Unsolved problems. So here's a couple of the things that we're currently gripping with, one being cloud federation, so we'll start with that one. So as I mentioned of use regions, this is kind of a solved problem, but we don't want the single point of failure with regards to Keystone or the database being stretched out. So one of the common aspects we have for consumers, I still want a way to manage many isolated clouds. There might be a tendency that's the same name across all those clouds. I want to assign a quote to all of them. We have client code that you can go do that, but having it somehow, somewhere that supported a model like this would certainly make our lives easier. I don't know if anyone else does anything like this, though, or if it's a unique problem to us. Instant state becoming coherent. Actually, we talked about this in the Nova SIG yesterday, so this might be a solved problem, but basically if you try to live migrate a instance and the source hypervisor crashes, it might leave duplicate volume attachments there and things of that nature that have to be cleaned up and right now we haven't found a way to automatically heal from that condition. However, there may have been some patches recently that will have to test to see if this occurs, but this is one thing, when you're running thousands and thousands of hypervisors, you're doing live migrations fairly frequently, one or two of them break, and this becomes noticeable because you have to go address it yourself. So strange raise conditions in APIs. This was something that some users will sometimes send two actions in quick succession. They might try and attach a volume twice in quick things like this. They may have been improved since USURRI or newer releases than we're using, but sometimes we'll have users try and kind of, it feels like they're stress testing the API in a way, but regardless when they do ports, volume attachments, things like that go into error state, at which point we have to go reset them to active and make sure that they're in an okay state. It's just one of the things that we see and haven't exactly figured out how to deal with that yet. And then one of the last problems is taming the Nova scheduler. So we have a heterogeneous set of hypervisors, different amounts of RAM and CPU, they're different generations of machines. We find that Nova doesn't always do the best job of bouncing things out to our liking. There are certainly tunables you can use to favor memory or CPU usage or things like that. So you do have to kind of play around with those, especially at scale to make things work. The other thing that we've done that helped us a lot is to maximize for anti-affinity across availability zones. So if you use server groups and you specify that you want an anti-affinity policy, it'll schedule across different hosts, but it might put a lot of them in the same AZ, which again, we're trying to avoid single points of failure. So we have a scheduler filter that explicitly schedules across availability zones. And that way we can guarantee that the actual instances are safe from outage. And that's it, wrapping up with just 30 minutes. We're over in the Bloomberg booth. If you have any questions, if you want to download our repository and play around with a Vagrant file, you can actually go and create this Thief Spy Network Topology, download our code, it will set up a whole Calico OpenStack Cloud for you based on everything that we do. And last but not least, we're hiring. So I don't know that I have time for questions, maybe like one quick one. Sorry, I can't really see the lights, right? Sure, so the question was, we're running all this stuff on one Ceph cluster for cloud. Do we have any disaster recovery plan? The answer is yes. So the one thing to remember, I think it was in the Marantis talk yesterday, they said backups are different from replication. So Ceph is really good at replication. It's not a backup system unless you have multiple ones and you're actually moving the data off periodically. But we're very careful and diligent about how we do testing. We have a staging cluster that we test, releases on for a considerable amount of time before we push them out. We haven't really run into any issues, so to speak, with as far as disaster recovery. We also have multiple clouds, right? We encourage users to treat clouds as availability zones. If you're putting everything in one cloud, it's not a good idea. But yeah, I guess I'll end it here. Please come to our booth. If you have any questions, you want to chat a little more. I just want to get off the stage because I don't want to hang up other presenters. Thank you.