 Hello, everyone, and welcome to Open Infralive. For today's episode, the large scale OpenStack show is back. Large scale OpenStack is a show for OpenStack operators by OpenStack operators. Organized by the large scale CIG, we feature open discussions with operators and guests around how to run large scale OpenStack deployments. Today's episode will be a deep dive into the Yahoo deployment. Yahoo was previously known as Verizon Media, Oath, and Yahoo. And they have run a massive OpenStack private cloud for quite some time. They were the first to reach the million core CPU core scale. So I'm pretty sure they have great stories to share. And so without further ado, let me introduce our cast for today. Our guests from Yahoo will be Brandon Conlon and Adam Harwell. And to discuss with them, our hosts today will be Mohamed Nizer from Vex Host, Arnaud Morin from OVH Cloud, and Belmiro Morira from CERN, who will be moderating the discussion. So Belmiro, take it away. Thank you, Thierry. Hello, everyone. I'm really excited with today's episode and have Adam and Brandon here to discuss the large scale Yahoo compute infrastructure. But first, let us know more about you. And how did you start it with the OpenStack, actually? Adam, you want to start? Sure, thanks. Yeah, I actually started with OpenStack way back in 2014 when I was at Rackspace. So I've been through a couple of organizations since then, I started off as a founder of the Octavia load balancing project. So I was doing load balancing at the time at Rackspace and we spun that up and I got sort of introduced to the upstream community and have had a ton of fun with it since then. Nice. How about you, Brandon? Yeah, so I'm a little different. About 10 years ago was my first introduction to OpenStack and that was actually as a user in the Yahoo search team. At that point, OpenStack infrastructure on the company was being spun up. It was an initial implementation. So I was a user for a while and then three years ago there was an opportunity to set up an operations team in Taiwan. I'm actually based in Taiwan. So I took the opportunity to be a technical manager, doing operational work and setting up a team in Taiwan to help have more global coverage within the global OpenStack team within Yahoo's. Nice. It's been a fun journey. So I think now the big question, the question that everyone wants to hear about is how about your infrastructure? So it was announced that Yahoo now has more than one million cores on OpenStack. So how your infrastructure looks like? Yeah, so we have a large infrastructure footprint, obviously. We have multiple geolocations around the globe to support the Yahoo services. And then within those we have multiple different clusters as well. So we have cluster supporting the personal virtualization where we give all the developers personal VM boxes just to use instead of having desktops. Previously in the company everybody would have a desktop. When I joined you were given a PC that you put on your desk and it was your pet. You kept it, you cared for it, you looked after it and it was special. And everybody treated the computer that way. And now when you join the company you actually have the opportunity to spin up virtual machines in different geolocations for free and just use them as your development system. So when we have that, we have bare metal clusters. We have different virtual machine clusters. And then we actually also have a new environment that we're spinning up as well that has, it's really combining the personal and virtual machines in the same clusters using newer releases of OpenStack and newer features. So roughly 40, 50-ish clusters around the world. So by cluster you mean different regions? Yeah, different regions, different like OpenStack installs. So they're completely separate systems, separate networks. All of them are completely sprighted. No key stoning command or no component command, right? Yeah, we have some things, I believe, Adam, you can correct me if I'm wrong, but some of them key stoning are common. So for the most part though, they're all actual separate files. So like our cloud.EML file is expensive and we actually give our developers an RPM or a get download location for that. Right, but your decision to have these completely isolated regions, it's because business decision or because the scalability of OpenStack. What was the motivation to have this architecture? A lot of it's the business decisions and the architecture of the company. With the networking, sometimes it just, it simplifies things for us in a way. It makes things more complicated, but it also simplifies all of things. Yeah, I understand that because I was imagining like one million cores in one OpenStack instance, that would be pretty crazy. Yeah, you could imagine if there was any required updates that required downtime would be rather scary. How do you deal with connectivity across those different sites? Like do you have some sort of shared network that's available across all these sites where the user can kind of provision a VM here, a VM there and they'll be able to just have shared connectivity out of the box or that's kind of a user problem? Yeah, I think that's a pretty standard, you know, large internet company way to do things. So we have our own backlines obviously with our own interconnects. So the actual co-locations are interconnected. But from a user point of view, it's pretty easy to just spin up a VM here, spin up a VM there and have them talk to each other not over the internet. Yeah, it's just relatively straightforward. We're moving forward to the new clusters to have the zero trust model where the networks are a lot more open and we can allow the team stuff to allow their own security groups and that's tied in with the Yahoo company network and controls, you know, the ACL controls. Yeah, I think also it's interesting when you bring up something like that is when you have all these kind of different keystones, I guess, and you have such a big CloudCML file, are you using any federated authentication or anything like that so that you're not maintaining like a whole bunch of users? Are you just like backed by LDAP? What's kind of the, can you talk about that? Maybe you can at the time. Do you want me to take that one? Do you want to take that one? I can kind of talk about that. So yeah, Keystone is backed by, we use an authentication system called Athens, which is an X509 like mutual off system. So we actually do our user off like in one centralized place in Athens and then all of the keystones globally just communicate back into that system. So that's where like, if you think of all of those keystones, it seems kind of crazy, right? But they don't all have their own like off DB or anything. That's all sort of more centralized. So Keystone essentially is just generating tokens for OpenStack, but the authentication is like something else is taking care of it and Keystone is just giving a token for that cloud. Yeah, Keystone basically handles the token, the OpenStack token, but then we, all of that stuff and all of the actual like off model stuff does just get passed through to another system. And I believe that is open source as well. Yeah, I think Athens is open source. I don't know if the OpenStacky bits are, but... Well, that Keystone, how do we use it is interesting, but I have a similar question about glance because if you have all these different clouds, so how you deal with images? So you replicate them everywhere with a centralized store that every glance connects to. Yeah, we use automation to replicate that across all the different clusters. The way we treat a cluster as well is it's a standalone because it's in a different co-location and we have the teams that use our system have their own business continuity planning. So they'll plan to have multiple locations where if one data center goes completely down, they can still manage to run their properties on the Internet. So we treat everything very segregated and siloed and that's one of the reasons why we do that. So yeah, we just have automation jobs that push. So we have one place that we can go and trigger image deployments and it will deploy across all the cluster shares. All right. But that is something that you developed, right? Because in OpenStack world, there is nothing that can trigger that replication everywhere. Yeah, we use another plug here from some more open source Yahoo technology. So Screwdriver, I don't know if any of you guys are familiar with Screwdriver. So Screwdriver was like CICD automation system that was developed internally in Yahoo due to scalability issues we had using Jenkins. So Screwdriver is a very scalable way to do deployments over hundreds of thousands of systems. So it gives us the ability to do some very nice configuration within it. So we can actually have our own Screwdriver images that are custom that we define and create with templates and steps that can be specific to OpenStack. So if we specify our own templates that have steps and it's very, very configurable and reusable across all the clusters. So we can quite quickly spin up automation that does a lot of these things for us. And I guess while we're on the topic of many clouds and how do you deal with that, what about dashboards? Do you kind of have your own internal dashboard that replaces Horizon or do you just have many Horizon endpoints and just go to the right one when you want to go and use a web interface? Yes. We have two or three different Horizon, two different Horizon versions that we run. Yeah, with just multiple options to pick right cloud. Yeah, I'm trying to remember how custom that is. I know we wrote something around that to make those regions show up properly, but I don't think it was a major change. So there's still kind of like one main Horizon kind of at the end of the day, which makes life a little bit easier when it comes to that. Yeah, although I believe most of our users end up using the APIs for the command line. But having only one Horizon is not challenging because you have multiple keystones. So where they connect to? That would be interesting. Okay. No, just thinking. We have... Yeah, so when you go to our Horizon, you do sort of a dropdown choice about which region to connect to and it picks the keystone at that point. Okay. So I'm thinking what you're mentioning, Adam, is you're not changing the dropdown once you log in, like the region dropdown inside. You have like another one before you log in. You can actually run inside it. Again, this is possible, I think, because the way that the auth works is that we have it's all centralized auth, right? So you're still using the same X509 identity. It's just the keystone needs to re-issue a new token when you swap. Right. Okay, cool. So you have all these different regions, these different clusters. How do you manage them? So you have some still in Diablo, others already in Wallaby. All these works, like having so many clouds. Yeah, we're not that far ahead yet. I think we have some still in our capital. Yeah, it depends. So yeah, there's a lot of stuff. And I remember my first day at Yahoo, coming in and having my manager whiteboard, like, okay, so we've got this cloud, this cloud, this cloud, this cloud, this cloud. These are on these regions. And it's definitely fun to manage. But for the most part, I think we have two major setups right now. One is Okada. And that's currently in the process of being upgraded to Wallaby. And in some cases, a little beyond that. And then we have another cluster that's mostly running master with a couple of exceptions. Master. Yes. Nice. So you're grading it as a running upgrade? Yeah. And so this is where that's a very interesting point. So when we say master, it is, in a sense, point in time. So when we need a specific upgrade or a specific package to pull down or when we're looking to do an update on a service, we will trigger a new build or one will be triggered for us. Most of our stuff is just commit, comes in, build, test, deploy. Not really hands on so much. But we can also say we want to just have any version of this go out and trigger a build of that. So you have some of the stuff that I say is on master is probably closer to even as far back as Zina in some cases, but some of it's a lot more recent as well. It's kind of a mixed deployment. So even if you have, I don't know, a database schema change which is coming on master, how do you apply it? It's fully automated because you said you have some sort of CAC. Yeah. All of that. So when we get a build and we make a new version of whatever service that is, part of our automation is applying all the database stuff. We do have backups, but we haven't ever had to use them cross fingers or knock on wood yet. It's been working fairly well so far. We, one caveat on that is that we have to be very careful with our downstream patches. We do have a few downstream patches, not so many on that specific cloud, but we have to be careful never to do our own schema changes, right? Because that'll do everything there. The idea is stay as upstream as possible. If we need to do something, we get it upstream. And how big is this master region? Because that sounds amazing. That is our newest cloud deployment. I don't know exactly how many... Thousands of hypervisals. Nice. So it's quite big. Yeah, it's in six geolocations. And you're not using any custom mechanism behind Neutron, for example, or any other component. It's full upstream, full master. Like I said, we have a few downstream patches, but they apply automatically as part of the build process. And we try to... There's an art that some of you may have experienced around making downstream patches that don't conflict. So we try our best to do things in that way. And if we can get anything upstreamed, we definitely try to do that first. Most of our patches are upstreamed and then pulled downstream until they merge. And what about... Is that deployment? Is it using any orchestration? Or is it containers? Is it just packages and Puppet? That deployment is still using our RPMs that we build and then they're deployed via Chef. Although the newer stuff that we're working on is containerized and deployed via Ansible. So the lifecycle of the infrastructure is handled by Chef, basically pushing RPM packages somewhere and just doing RPM install, right? Yeah, basically packaged virtual environment. That's that particular one. We have, again, the number of roads that we have is fairly large. And some of them are deployed in slightly different ways. For the most part, though, it's all Chef right now. I wish OpenStack was just pushing a bunch of our PMs and packaging and calling it a day. So this master cluster is pretty interesting and quite challenging. But I think the big challenge is the old ones, the ones that you said that you still have in Ocata. So to move that forward to more recent versions, what's your workflow on this? What are your plans? So there's a project that's been ongoing for a while. I think we just call it upgrade from Ocata because the end goal is a little bit nebulous. I think when we started, the goal was Wallaby because that was the latest release. So there's more now. I think that gives you an indication of how much time this has been taking as well. Right, yes. But this undertaking, a lot of that actually was, the work was more around moving from this RPM-based thing that's on Chef to a totally containerized deployment model that's based on Ansible. So not all of it was actually around the OpenStack itself, but a lot of other bits and pieces, moving parts. The plan is basically a large fast-forward upgrade through however many versions that is. And again, not all the components are going to land on exactly the same version. This is one of my little pet things, is that OpenStack really doesn't need to be run on a single version. It's very simple. In a lot of cases to run older and newer services together. I mean, I remember when I started working on Octavia, we would run Octavia Master, which at the time was, I don't know, Queens on a Liberty cloud. And that worked perfectly fine. Because it's all about knowing what services each service interacts with as part of operation. So for the most part, you can upgrade stuff, not in lockstep. Yeah. PromoCata is a huge step, right? Because also you're going to jump different Python versions that are supported by OpenStack. Yeah. We did another project to get off of Python 2 a while back. And so actually I can't remember whether all of the Octavia stuff actually made it to Python 3 or not. I think some of it may not have. I think there was only one component. Yeah. Definitely a challenge. Tyrese is kind of another interesting question, which I guess that probably big reason, or part of the reason why you're moving to containerized is what about your underlying operating system, right? I mean, depending on what release you're running or what flavor of distro you're running, that's quite a bit of operating system upgrades too. So is that why you're containerizing or kind of like what's your thoughts on that? Yeah, that's one of the reasons. One of the reasons we have, you know, we're trying to do regular OS upgrades. So we want to keep up-to-date with security, patching specifically. So we want to have a very regular cycle where we're doing that. So that's one of the reasons. Another reason is really, you know, orchestration and roll-outs and also rollbacks. You know, as chef is, you know, it's an okay orchestration too, I think. You know, people that have had a lot of experience with it know the benefits, but there's also drawbacks. So I think, you know, moving towards the containerization is definitely a step we want to take in the future. And then we can actually look more into how we orchestrate that container deployment as well across all the clusters to try and automate things. With a very small team, we need to try and make a leverage as much automation as possible to try and free up time to focus on other issues. And how do you manage, then when you need to create your hypervisor, how do you manage to move the load out of the hypervisor? Are you not moving it, or are you shutting down maybe the instances or are you moving the instances? Yeah, at this point in time, we don't have any life migrate or anything like that. So, yeah, it's very problematic for the users. Luckily, a lot of teams have some good automation involved, so they can handle it quite smoothly without any issues. They can take their downtime. It's very problematic. So, there's a lot of scheduling there. There's a lot of issues. But it's definitely something we're looking at changing going forward. So, you're not doing so much live migration. You're not doing live migration at all, right? Okay. So, you... It's kind of nice. Yeah, it's nice. I don't know, we would actually quite like it. Not for the users, but just as an operator, it's like, I don't have to worry about that. So, down restart it. Have a good day. It still means you need to at least do the computes one by one or not in a bulk in order to avoid shutting down all the instances of your customer in a short. Yes, we have to... And that can be tricky. The scheduling, understanding, what's where, with so many teams, I mean, we have hundreds of teams with, you know, thousands of projects all running across infrastructure. So, that whole management pieces can be very tricky in time consuming. Yeah, and is it something you do manually or you have any robot or any tool to implement that? We have some tooling to help us. I'll say it manual, you know, it's only really... We don't have an intent, automated process. Well, unfortunately for that, I wish we did. If you know any, please please help us, let us know. Right. I think we have a question from the audience, yeah, we actually have two questions from the audience and the first one is from Andrea Francescini asking, how big are those regions? Like, can you give us some indication of the size in terms of number of compute nodes? Yes, probably in the region of, you know, a single region could have 100,000 or across the different systems. I think. And the other question is from a very regular viewer, Sung Soo Cho from Korea asking, what type of ML2 driver do you use, which is a much more precise question? I don't actually know that off the top of my head, Adam, do you know? So I am very self-professed and not a networking person, but I can try to find the answer to that question really fast and possibly get back to that. Thank you all. So, can you go back again to the live migration question? So why not live migration? What is the motivation to not have it? We are motivated to have it, but we have some issues on the way that the company networking is in place makes it tricky for us to be able to utilise live migration. I think that's probably the best way to explain it. So we would like to do it. We're still investigating to see if we can find a solution that would allow us to implement that because it would definitely be a benefit for our users if we could have that in place. Yeah, I think there's a major blocker for that if I remember correctly. Brendan was the hypervisor-to-hypervisor communication because the way we do our security model we just can't allow them to communicate in the way that opens that expect. It's quite open actually. Yeah, and we take security very seriously here. I mean, I'm not going to bring up history, but Yahoo has a history with security and we do not allow that kind of openness in our infrastructure anymore, so that's been the major blocker. But you still provide layer-to-networking between computers for the instances? I mean, instances can communicate in a private network, right? Yeah, I mean, they have their own networking layer that they can communicate with each other. But I mean, the security model for the upstream live migrate typically has hypervisors communicating directly to each other as basically root users, which is not essentially into each other. Yeah, we don't know why that works. Yeah, I think the SSH stuff should probably be eventually pulled out at some point because especially if you have a shared, like you've shared block storage, it's just no point. It just needs to talk to Lippert. But speaking of that, what's the kind of the story of block storage at Yahoo? Like, is it SAF? Is it Cinder with some insert vendor being used for block storage as a backing? Is it just local storage only? So yeah, we're using Cinder with a vendor that we've chosen to support us on our infrastructure. So we have block storage available on the newer clouds that we've put in place. We're also leveraging some Manila as well in certain situations. It's definitely useful especially for the development systems. It makes it easier for us to tear things down and not affect teams. If somebody knows we're going to be regularly doing rebuilds on hypervisors and specific environments they can leverage either block storage or file storage to make that less impactful to them. Right, so... Yeah, go ahead, Adam. No, it's... Okay. So I'm still thinking what Brandon said when answering that question from the audience. So some regions, some clusters can have like hundreds of thousands of resources there. So is it the time to start with the right MQ run here at time in the show? So what are the big challenge when scaling to those numbers? Because we always discuss Neutron, Scalability, right MQ. So... What are your findings when running those numbers? I think very similar. I think everyone is probably the rabbit MQ. It's probably the biggest issues that we've seen running at those numbers, just sometimes the stability with rabbit MQ. So we've actually spent a lot of work recently looking at changing the way we run the rabbit MQ to try and help that. Yeah, I think the most recent, well, we've got a couple of recent changes. The last one that was rolled out across the board was switching everything to use individual de-hosts per service which seems to have possibly helped with that at least allowing us to if we need to restart a de-host we can do that and not take every service down. Do you have a big rabbit MQ cluster for the entire region? It's my understanding. Basically, I think the rabbit is still per cloud as well, essentially. So it's not one per region exactly. There's a number of them per region. You have a difference between the cloud and the region. Yeah, so this is something that's one of my pet peeves but there's not a really good way to solve the ambiguity of these terms. Region is a little bit overloaded, right? So there's keystone region which we actually don't use. When we say region what we mean is like geographical region. So a data center is a region. And within the data center region we have all of these different clouds which you can think of as basically each keystone. Maybe. Yeah. AZ is another one that would be there. Okay. Like because then you're talking about like Nova Neutron AZ, right? Which is different and we have those but that's within the geographical region being DC. And then keystone instance fronting a cloud and then within that there might be AZ but the AZ there would all be within the same DC. Right. So basically you have a keystone. You have a RabbitMQ cluster. Yes. Basically a keystone and a rabbit and then a set of other service APIs. Okay. So it sounds like quite one cluster for multiple OpenStack regions and it's still a big cluster running for a lot of the computer. You said you're not separating for example, Nova and Neutron on Rabbit. I mean Nova and Neutron are communicating with the same Rabbit instance. Yeah. And we created like I said we switched to at least not using all the things we host for that within Rabbit. We got to be as per service. And you are spawning a cluster of Rabbit or is it only one server? It's a cluster. So I think we use three Rabbit nodes. And then the newest and newest thing that we're trying is ORMQs for Rabbit. So that support just recently got added to Oslo. And we're trying to deploy that presently. I don't know. I think I'm waiting for one matter of time. I don't know. I appreciate there's I cannot I'm trying to look up the thread but I think it's an individual who works at OVH and he has done an extensive amount of research into like deciphering this issue where if a node of Rabbit goes down the whole cluster just stops responding to things randomly. He's come up with a reproducer as well and he has like these two ways that you can run Rabbit nodes at night. So we're going to run a test for a node on the index where it can fail as well. And it's just like I don't know for us we're just going to move away and we're going to let Kubernetes we're just going to have one node and then if it fails Kubernetes will reschedule it onto another node and we'll have health checks. And I think that's probably the best down, you think, cool, that's why we have two others, but that's not how it works. So the way that we are doing it at CERN is we don't have RabbitMQ clusters, mostly, we still have few, but mostly it's just one node. And if it goes down, it's easier to recover that field node than, yeah, covering a cluster. That's the way we operate. I think the move is to have one rabbit per service. If you have the ability, if you have Kubernetes and you can easily orchestrate them, like one rabbit per service and just scale up the resources on it, if it needs more and sleep on it. That I think is, that's my personal ideal setup as well. And I've been pushing for that, actually. We haven't gotten a lot of traction around that just because the extra resources. Like when you think about that, it makes sense, right? But then when you've got, what did we say, 40 clouds or something, right? And then you think about trying to orchestrate getting that many more, where each of these are running on, I think we use bare metal for the rabbit machines. And there's internally here, there would be ACLs and whatever, it's possible, but it's a huge headache, right? Yeah, run them on VMs, we can do that. Yeah, I will. So I used, at a previous incarnation of a previous job, I guess, we did that and I ran rabbit per service in Kubernetes, it was great. But it's a lot harder in this infrastructure. I would love to move there, really would. Or to something completely different. I know, I have a friend at Fujitsu who's working on, I think, an alternative to rabbit for their setup and, or sorry, not Fujitsu, different, hey, you can move somewhere else, but. Would be line. We've definitely looked at some other stuff as well. Yeah, I think it was line working. Yeah, yeah, yeah. Yeah, he's at line. Yeah, and so maybe to close on the rabbit and queue section of the show, which is a highlight always. We have another question from Simonsu show around like beyond those rabbit and queue pain points in scaling, what are, what are your main pain points in open stack operation? Yeah, I think so, God, I think some of it is around, around the automation and deployments and upgrades. I think we've touched on some of that already. That's one of the reasons the UFO upgrade from a cattle project for, we're trying to add them containerization. So, moving forward with that, then gives us ability to look at possibly using Kubernetes and benefiting from that as well. And then also, I think the other thing, main issue is the regular OS upgrades. How do you do that in a manner that is the least impactful to the business? But also, the easiest for the support teams to manage as well, I think. They're really the key areas. I know as well, James has a desire of the future where I think I've mentioned we're moving, we're consolidating some clusters with the new, new clusters we're putting in place. We're going to combine production virtualization with the personal virtualization clusters, retired old clusters. And maybe in the future, have the ability to be able to have large clusters that support the whole footprint. So, don't have separate bare metal virtualization. Be able to just, the users come in and be able to get any compute resource they desire via flavors. If they need a VM, they pick the VM flavor, if they need bare metal, they pick bare metal. That's the, that would be a nice place to be in the future. If we could manage that, I think the team would be very happy. I guess that leads to another question I have is, would the new guidance that has been proposed by the Apple Stack Technical Committee with like a way to upgrade every year instead of every six months, do you think it would help you with that pain point or is it still the same pain just less often or how does that impact your deployment there? I'm not sure if that would help us or not because I mean, given the way that we upgrade, I mean, and given that OpenSec has done a pretty good job, I think of getting fast forward upgrades working, whether we upgrade one or two letters at a time isn't really, I feel like super relevant, especially given that in some of our clouds, we're just basically running Gitmaster from some point in time. So when the relief actually happens is a little bit less relevant. That said that doing that does take some sort of vigilance because you have to know when things are in a same state. Hopefully people don't merge stuff that just breaks everything mid-cycle with the anticipation that they'll clean it up before release. But I think it might at least help as a developer, the same developer had on. I do kind of like the idea of going to those longer cycles just because the time between like spec freeze, code freeze, all of that is sometimes frustrating for me, because it's like, oh, are we back in code freeze today? Seems like just yesterday we were frozen too. But with the operator hat on, I'm not sure. It really makes a lot of a difference for our model. Okay, and finally we have one last question from Andrea Francescini, more of a precision on the previous question we had. You mentioned regions with 100,000 nodes. For a single cluster, is that the same? Is it like one cluster for the region or what's the size you get for a single, you get two numbers of nodes for a single cluster? Some of our single clusters, so the clouds would be definitely multiple tens of thousands of compute there, whether that's bare metal or hypervisors, thousands of hypervisors on a single cluster, much more bare metal. The bare metal footprint we have is rather large and the company's still something we want to have that shift away from that goes back to the pace. It's hard to get people to give up with pace. Yeah, that's definitely on the high side of our usual range at the large scale, see that we observe so far. And also a question around scheduling in that case, again from Andrea, how does VM scheduling works in that case? Do you schedule across multiple clusters or do you like schedule within a cluster? The scheduling is all within cluster. I mean, so again, everything is very separate and we do have some AZs, but we also like, I think there was a question earlier about the ML2 and I still didn't get an exact answer for that, but most of our networking is a little bit more static here. So we use static routing heavily. And so we have a few custom filters for the scheduling that we'll schedule based on the network that's been chosen and the Nova AZ, and it'll put it in the right place. I mean, it's not to say that that's really unusual, right? That's basically how it works, but it's pretty much the standard scheduling mechanism just with a couple of extra little filters in there to make sure things end up in the right place. Yeah, but if you said you're having custom scheduler driver, it means your scheduler is able to figure out where to place the virtual machine more easier than an upstream pure scheduler. I don't think really that much, no. I mean, like I said, there's like one or two extra filters, but mostly that's around understanding the static routes. So like if we schedule to a specific network, only maybe a subset of hypervisors have access to that. And so that's just a very pretty basic filter, honestly. Otherwise it's all standard. Is it something you introduced in placement itself or is it in the scheduler? So if you recall, a lot of our clouds are running Okada, which was pre-split placement. Yeah. So that would be in Nova scheduler, I guess. But that's easy enough to move into placement. I think we've done that in the case of our newer clouds. Yeah, but actually we use very similar filter and actually now it's upstream. It's, I don't remember. Routed networks. Yeah, routed networks. So yeah, we've definitely tried to have some sort of hand in getting that to where we want it to be light hand, but definitely have been hoping that that would get to somewhere where we can just use that, right? Cause that would be the ideal in all upstream, no downstream is the goal. If only just because it's way less work for us, right? As operators, right? I think it's, sorry, go ahead, Pomira. I was about to change topic, but go ahead if you have something. I was going to say just about that last note. I think as like everybody, when you start early in your kind of development and operational career, you want to build all these tools and all these like nice creative things. And then the more you do it, the more you're like, I actually want to write the least amount of code possible and maintain the least amount of things. And that's like, as you kind of grow with the career, you're like, actually that's probably not the path we want to go. Yeah. So yeah, the topic that I would like to discuss now is bare metal. So you already mentioned that you use bare metal. I know that you are a user of ironic on the bare metal. Can you let us know a little bit more about it? Why use it? Changes that you needed to have in to ironic to support that scale. And if you can talk to the why we use that part, maybe. Yeah. And you talk to the changes. Yeah. So, yeah, I suppose that the bare metal versus VN, you know, it's definitely an interesting topic. You know, our general, you know, principle is everybody, you know, possible should be using virtual machines or even better style a higher level service on top of those virtual machines. Not necessarily just virtual machines, but, you know, in reality within the business, it's not possible. Not every workload is able to be, you know, used on VMs or is even appropriate for VMs. And for those situations, you know, we want to have the bare metal available for them. You know, as I said, our eventual plan would be to try and have these combined, you know, so we could support the bare metal and virtualization from the one cloud, but there's definitely a shift. There's a lot of, you know, Yahoo's been around for a long time, supporting large infrastructure. There's a lot of history there. And I think everyone knows, you know, initially everything was built and configured by Yahoo, trying to sometimes prize that out of people's hands and get them to move to more of a cloud native way. It's not the easiest. So, yeah, we continue pushing, you know, we're trying to lead people in the right direction and move away from bare metal. But I think there'll be always, always use cases within the company where we have to support that. And we use bare metal. I mean, our clouds are built on bare metal as well, so. Do you need to have different clouds to offer VMs and bare metal or you can offer both resources within one cloud? No, at this point, we have different clouds. So it would be nice to have one cloud available. You're seeing your deploying event on bare metal. Are you using OpenStack to deploy your OpenStack? Yeah. Okay. And what about the first one? It's the first OpenStack you deployed. Yeah, the first OpenStack, obviously chicken and egg. We have to have some way to deploy it. But I mean, at the end of the day, you could spin up a standalone OpenStack cluster to deploy the first OpenStack cluster to then use that to build everything else. But actually we are doing that at CERN. So our bare metal or compute nodes, they are provisioned by Ironic initially. And then the AI providers that host the VMs and then containers on top that actually host our control plane. And so it's really inception differently. I still think what you guys do at CERN is terrifying. And the idea of trying to bring those VMs back up if the control plane is down, just like, I'm sure you guys have practiced it. Every time I bring it up, it's like, yeah, we've done it. It's not that hard, but just me from like just a grasp it of like the, my OpenStack runs my control plane is scary for me. Our bare metal control planes are rebuilt by themselves. If I have to OS upgrade our bare metal, I just use it to upgrade itself. So we take individual nodes out to OS upgrades, put them back in. Same with the hardware failure. So we use the Ironic for all of that. For the bootstrap, yeah. Yeah. So, and about Ironic, it's quite the early version for Ironic. Yeah. I'll wait, I don't see what these ones. Yeah. And so I've only recently started looking at our Ironic stuff as part of this upgrade project. I actually was primarily on the VM and Octavia side of things. But I can't say like, yeah, it works, but there was definitely a lot of tools a lot of tweaks to it. I think we had something on the order of 60 downstream patches applied to our, a Cata version of our Ironic. Everything from, I think we had a custom quotas implementation to handle the different like network zones and flavors of stuff that people were allowed to use to, I think we use the custom. This is where my lack of Ironic knowledge shows it. I've got these terms floating around in my head like Anaconda and IPA. And, but I know we use the custom like boot image system which is now upstream. And so as part of that project, we've been able to remove all of those patches. And then a number of patches around speed. So I know there was a few issues upstream where either everything from just listing the Ironic nodes when you have 10,000 of them is rather slow because of the various information that it would fetch in the standard upstream version to like server deletes when you delete bare metal to order that it would do things. And we had to shift a little bit so that it wouldn't take 10 minutes to delete something. Yeah, there've been a lot of improvements during the, actually in the performance, the latest versions. Yeah, but I will definitely be interested to hear more when you guys upgrade to a more recent version because that would be a great talk. Yeah, I think we've cut the patches more than in half. I think there's, we're down to at least a somewhat manageable number for what we have now. And hopefully in the future, we can cut that down even further. All right, so we have five minutes till the end of the show. I think we need to start to wrap up. So how do you see your infrastructure going next? Of course you want to upgrades, but is there something else in the pipeline like offering containers and so on? I mean, so I know we do have a very large Kubernetes team internally that actually, I don't remember whether they're currently running on our bare metal or not. I know they were looking at it for some period of time but I think containerization is probably not inside the purview of the OpenStack team here. I think we're happy providing like just a large amount of bare metal and VM services to users. The upgrade is definitely the biggest thing on our roadmap right now. And I know we have a few other features we're looking at mostly around just improving like the speed of things for users. I think our newest cloud is the only one that provides nested vert, which was one of the big asks and is going to help get some of the people off of bare metal that previously couldn't. So getting that rolled out more broadly might be one of our things. Thank you. So I think we have the last set of questions from the audience, Yeri. Yeah, we have one last question before we close around the services that you have enabled in your deployment. So you mentioned obviously Nova, Neutron, I heard Manila, Ironic, what else? Yeah, we do use designate and we do use Octavia for load balancing. We used to run Solometer but we've since shut that down actually we use Stenlin for clustering. There's other ones. I think that's it. So about Octavia, you're using bare metal for Enfora? No, we're using VMs for them. Okay. All right, one's go to go next. Otherwise we wrap up the show. If I may ask a few questions about the topic upgrading, not upgrading but pushing new change on your infrastructure. We already talked a little bit about that but you said everything is automated. What if you need to hold back a change or if you have an issue on your infrastructure, how do you manage that? Is there any automated way to, I don't know, to heal the infrastructure or is it a manual intervention you do? Yeah, it can vary, it depends what the issue is. So we have, as I say, we have automated CICD process built on screwdriver. We also have a lot of automated testing. So we have monitoring built in. We have a Yahoo internal monitoring, custom monitoring system that we use to detect any problems. And we also have our own jobs that build, continually build virtual machines or bare metal and just validate that the cluster and everything's functioning. So we get advanced warning if anything goes wrong. And usually when we're doing deployments because they're bare metal systems, we try and do one node at a time and do a rolling upgrade with validation built into that process. So hopefully we're gonna catch things early. Obviously, I think everyone's aware, they're never 100% clean on deployment. So when we do have issues, because we're using Chef in the way things are implemented rollbacks are not easy. So usually it's a patch and roll forward. And with the CICD process, sometimes the testing lead time can be difficult. So we do have ways to do intervention and patching. Obviously we can't do it manually. We're pushing out across the different clusters across the world. We don't have enough people to manually do it. So we leverage automation and some more or some other automation to be able to help us do that. But it's usually hands on troubleshooting, identifying the issues, stabilise as quickly as possible. And then we make a decision, do we follow up with a new deployment or do we do a manual push to recover? I definitely believe moving to containerisation is going to help there a lot. Sorry. I believe there was something, there's a question about like if there were multiple patches that needed to be synced. And yet we can pause manually at various stages of the deployment. So we can say right before this is actually going to go out to production, pause and don't roll forward. But also a lot of that's just taking care of automatically since Brendan was saying things will go through multiple. We have like QE then staging environments that things roll through and if changes really actually rely on each other, it will deploy one, it will break in staging, pests will fail, deploy won't move on, second one will be deployed, it'll start passing, pipeline's unlocked. So for the most part we don't really have to do that if it is an option to go in and manually stop. And just a quick one, how often do you push new change on your infra? I mean, it's not really a set cadence. Like it's, we don't have like trains set up or anything. It's pretty much when commit gets merged with a new, should that change get pushed out? It works this way through the pipeline. So I mean, in practice, I think we've seen multiple deploys per day or one a week or I know we do have like we run some critical services that obviously around like holidays or special events we don't wanna go down. So we do sometimes do change moratoriums. And in that case, maybe it'll be two weeks before a change goes out, but that's generally like the longest I've ever seen. Okay, thanks. All right, so let's try to wrap up the show. And I think this question we're gonna do with all our guests that we have in this large scale show. So with all your experience operating this large cloud, I'm sure you have funny stories like near misses. Can you share something with us? Trying to think about this. I mean... Well, you didn't delete the clouds. Right, yeah. I mean, I'm not gonna say you've never had any issues. There's definitely been some, but let's just say I've never had anything nearly as bad here as I have at some of my previous orgs. I mean, I could share more stories from previous orgs that are a little funnier, but honestly, I've been really happy with how mature the automation is here such that most of the time things go out and they just work. I mean, I guess the worst I can think of was given the scale we operate, a change that appears to be functionally correct and will work in a QE or a staging environment that has 10 or 20 nodes. When it gets pushed out into an environment that has 10,000 might operate a little bit differently. So I think the worst that I've ever seen was when I switched our front end to using UISG, some of the settings that I had for that that were perfectly fine in staging did not appropriately address a cluster of 10,000 nodes in a production environment. And we had our Keystone API basically just stop functioning for a few hours before we were able to resolve that, I think. And that was a case of really all hands on deck figuring out what was wrong and pushing forward. Fortunately turned out to just be, for the most part, some settings around like threading, but. Yeah, when we discovered the issue is always something. Yeah, I'm a huge proponent of move forward, not back. So fortunately I've been able to pretty much always make that work here. I do remember one issue we had, I'm not sure if you are, on the team at the time, Adam, it was a while back and it was very difficult to troubleshoot and identify. And it was happening on the hypervisors, the networking just suddenly went. The hypervisor was active, everything was running and then all of a sudden we were losing hypervisors across different clusters. And we had everybody, everybody jumped in, troubleshooting, taking dumps of the network traffic, looking at firewall rules, looking at everything, trying to understand why these systems are suddenly, we can't even SSH into the hypervisors at all, they're just gone. You know, and we hadn't done any changes recently and it seemed to be happening quite regularly but at different times on across the cluster. And yeah. Well, my guess now is DNS, like, when does that happen? Yeah, that's usually the one. And you know, that was something that we were all like, ah, could be DNS. But no, this time it actually wasn't DNS. It was actually, we found out it was CrowdStrike. The security team had managed to push in additional monitoring and logging within the CrowdStrike software and it actually brought down the whole network. It got to a certain point and it just blew the network off of the hypervisors. That was an interesting one that took a long time to troubleshoot, shall we say, and there was a lot of people scratching their heads saying, I don't understand this. So thank you so much, Brandon, Adam. Thank you so much for sharing all of this with us. It was a great show. It was very interesting to understand a little bit more about the infrastructure. Yeah, thank you for having us. Yeah, thanks everyone. It was really a great discussion. I appreciate how much you were able to share during the show. I know it's always difficult to talk about internal deployments, but I feel like you were pretty spot on and being able to share a lot of details, and I appreciate that. So as a quick reminder, the Open Infrastructure Community will meet in person in June in Berlin for the Open Infra Summit. So I wanted to ask you, is any of you going to be there? I know Mohammed will be, but I don't know about anyone else. Will you be able to join? Yeah, I'll be there. I would be there as well. Well, if you... I would love to be there. I'm still working out whether or not that's going to be possible. Yahoo! I think is still on COVID travel lockdown officially, so we may not have official support to go, but if I can get over there, I really do miss coming to Summit and seeing people. I think one of the biggest great things about this community is getting to all come together like that. If I can make it, I'll definitely try, but it's not looking super promising at the moment. Yeah, and I wanted also to mention that as part of the Summit, we ran a series of open community discussions called the Forum, and if you want to discuss a specific topic with the rest of the community, it is still possible to submit topics until Wednesday next week at cfp.openstack.org. We also have an Ups Meetup on Friday of the Summit Week in Berlin. It's an event that is dedicated to open discussions between operators, so very similar to what we run today here. You should follow the OS Ups Meetup handle on Twitter for upcoming news about this event if you will be in Berlin in June. So that's all I had for today. Let me thank again all of our speakers for today. Adam, Brandon, Arnaud, Belmiro and Mohammed for this great discussion. And see you all very soon for another episode of Open Infralive.