 Hello everyone and thanks for joining us. Welcome to Open Infra Live, the Open Infrastructure Foundation's weekly live show sharing productions, case studies, open source demos, industry conversations and the latest updates from the global open infrastructure community. This is episode 14 already and we have seen some great content and we'll see some great content coming up. So I hope that you can join us every Thursday at 14 UTC, streaming on YouTube, Facebook, our link and LinkedIn. As I mentioned, this is a live show so we'll be saving some time at the end of the episode for Q&A. Feel free to drop questions during the show into the comment section and we will answer as many as we can. Today's episode is part of a series on large scale OpenStack infrastructure, promoted by the OpenStack Large Scale SIG. We invite operators of large scale deployments and get them to present how they solve a given operation challenge and discuss life between themselves, their different approaches. Today's topic is spare capacity. One of the reasons workloads moved to virtualization and clouds was to avoid having underutilized resources. But as demand for resources goes up and down, the cloud itself can now have a lot of spare capacity. How do OpenStack-based large scale clouds manage their spare capacity? Our guests today are Brandon Conlon, DevOps Manager at Verizon Media, Chris Bermudez, Engineering Manager at Inmotion Husting, Eric Johansson, Senior Systems Engineer at City Network, Victor Molnar, Cloud Architect for OpenTelecom Cloud and last but not least, Belmiro Morera, Cloud Architect at CERN, who will host this discussion. So passing on to you Belmiro, take it away. So thank you, Thierry. Hello, everyone. I'm Belmiro Morera. I work at CERN. CERN is the European Organization for Nuclear Research and today I will drive this discussion about how different cloud providers manage spare capacity. I'm really happy to be in this great panel that represents different kinds of clouds, public, private and scientific clouds. To kick off the discussion, I will show you a very small presentation to give you some context. Here we go. So moving workloads to the clouds as several benefits for the users. We won't go through all of them, of course, but instead we will try to focus in the scalability and pay what you use model. One of the promises of cloud computing is that users can start small and when they are ready, they can scale massively their workloads. This is the illusion of unlimited available resources. Also, they don't need to care about the underlying infrastructure. They don't need anymore to manage the hardware, go through long and expensive hardware purchases and, at the end, buy more capacity that they don't really need because, you know, just in case. So moving now to the slides. However, as operators of large-scale infrastructures, we all know that to provide the illusion of unlimited resources to users, it's a huge challenge. Infrastructure needs to be ready for the ups and downs in demand from our users. So the problem of over provision capacity that was in the past in the user side has been transferred to the cloud providers. This can lead to the cloud providers having spare capacity, meaning that the resources are not efficiently used. So I'm sure that different clouds have different strategies to handle these spare capacity problem, maybe quotas, spot instances, reservation strategies, different hardware purchase models. And this is what we will try to discover today and how these different OpenStack clouds are solving this problem. So let's start the discussion. And for everyone that is following us in the live stream, please don't forget to leave your questions in the comment section. We'll try to do our best to answer them. So hello, everyone. I think the first question that needs to be answered is, who are you? I know some of you already. So let's start with Chris. Can you tell us a little bit more about you and Emotion Hosting Cloud? Yeah, yeah. So I'm Christopher Mudez. I'm the engineer manager for Emotion Hosting. Also, one of the technical leads for our FlexMetal Cloud product, which is on-demand private cloud that we can deliver under an hour is what we aim to do. But yeah, that's kind of the gist of it. I've been in the field here for a little bit and a little bit new to the whole engineering world and OpenStack as a whole. Great. Thank you. Eric, you want to go next? Yeah. I'm Eric. I'm a senior systems engineer at CDNetwork, European hybrid cloud provider. My current focus is a little bit more on the automation and tooling side, not necessarily purely with OpenStack, which also involved the gathering of metrics and alerting and visualization of, in this case, capacity management. Great. Brendan, want to go next? Hi, yeah. I'm Brendan Conlon. I'm the DevOps Manager for Housing Media. We have our own private cloud system that's utilized across company, mainly supporting all the different web services and products that we have, so fairly large implementation. Yeah. I think that's better. Thank you. And we have Victor on these systems. Yes, thank you. So my name is Victor Woderhardt. Currently, I'm working in as a cloud architect in the OpenTecum cloud. And basically also, let's say, as a product owner of the hardware part, I'm responsible for capacity management as well in our environment, and mainly the OpenTecum cloud is for public cloud. So basically for big companies, but beside that, we also offer hybrid solutions, so not just a public cloud provider. And also regarding the scale, I would say it's quite big, at least in Europe. So basically, currently, for example, for us, we are talking about 700,000 VCPO or something like that. So it's from my point of view quite big environment. Thank you. Thank you, Victor. And also, we have me. We did the CERN cloud. So the CERN private cloud has around 6,000 compute nodes and manages as well 8,000 bare metal nodes through Ironic. And we have around 30,000 virtual machines in infrastructure. The cloud supports the different organization activities, administration, different IT services. But essentially, more than 80% of the cloud capacity is to process the scientific data from the different experiments in the organization. All right. So let's address the problem in the room. Do you have spare capacity in your cloud? So who wants to start with this? I think if you might, I can start with this one, because especially for us in the public cloud provider, it's really important to have spare capacity. Because as you just stated in the beginning of your presentation, what we promise to the customers is basically we promise endless resources. And we also promise that they can consume anything at any time when they need it. So that's why for us, it's a really crucial question. And it's really important to have any time of spare capacity. But to be able to do that, because naturally, it's also a high value business. So we need to care about how much we keep as a spare capacity. And also why we are keeping as a spare, how we can reduce our cost. Because from my point of view, it's also really important, not just about the wasting of resources, but also if you waste resources and beside that, also you waste operation cost, then it will be higher. So in this way, from our point of view, what we every time consider is the most important part is to serve out any customer request when they need. So that's why for us, we defined, let's say, our value chain, how much time we need in case if there is a customer request to be able to deliver it to our cloud. So basically, how much time we need for order hardware to deliver it to the data center, racket, cable, it installed it, and so on. And based on that, when we figured out as baseline this time, then based on the previous trends, what we are currently checking, we can figure out that based on the previous trends, for example, for the following time, I mean the same amount of time that we need for a big delivery, how much capacity should be necessary. But to be able to, let's say, make it even more harder, you know, if we are talking about OpenStack, then OpenStack has lots of different parts, not just flavors that was also mentioned before by the Miro, we need to consider with the Ironic, the Bermuda servers, we also need to consider some dedicated hosts and lots of different parts. And also in OpenStack, lots of upper-layer services, what is rely on these IES solutions. So in this way, almost every time, what we are checking, and we call it fragmentation, because you are not able to 100% allocate any of your physical device, just in case if you use it as a 1-1 ratio, or a Bermuda, or they get it or something like that. So in this way, we calculate the, let's say, loss because of the fragmentation, and then we keep as much spare capacity as we can, what is able to, let's say, what is necessary to be able to survive until we are able to do a big expansion in case if it's needed. Because as a business, what is the most important part to serve for the customers, because not sure they are generating money. So in this way, I think this is the most optimal problem. So what you are saying is that basically for public cloud providers, it's inevitable to have spare capacity. Yes, but also it's different in different cloud providers, because you also need to consider if, for example, a public cloud provider is also creating the hardware from themselves, that's a different story. If you need to buy it, then it's also a different story. So for us, you know, because we are in partnership with Huawei, so for us, this hardware delivery is a little bit, I would say, more easier. So also we are able to do that in some cases, not just keep spare capacity inside, let's say, our server rooms, but we can also keep some spare servers in the warehouse. And what I also mentioned, what is from my point of view, also really important, even if you have the spare capacity, and this is where you are able to reduce your cost and, in this way, be more efficient, you need to consider when something is not used, then how you will be able to, for example, turn it off. So in this way, because also, for example, for the BMS Ironic, it just starts when somebody would like to use it. So if you are able to do something also with the normal flavors, so for example, when you have lots of spare capacity, but from the spare capacity, a big bunch is sitting over off, then in this way, you are also able to reduce your cost. Yeah, so let's try to touch all these points later on. I would like to give the opportunity to the others also to give their opinion. Eric, also have public clouds? Yeah, I definitely agree with Victor on that case. I mean, we are a hybrid cloud provider, so we have both seven public regions globally, and also a number of high security clouds, which are private or multi-tenant as well. But public clouds is a very, very different story compared to most of the private regions we have. And as such, we have more spare capacity in the public regions, because just like Victor said, we need it. I would like to, I usually say it's a little bit more wild west on that side. And we try to solve that in some cases like Victor mentioned, having spare hardware in the different data centers. We also have some examples where we run some of that hardware in pre-production environments, which makes it easier to either ship or move if it's in the same DC into the production environment when we see the need. And we have some other use cases as well where we can reschedule certain aspects. For example, I mean, we run a lot of test and sandbox systems internally, which can reschedule to different regions based on the metrics we're pulling in on the spare capacity at the moment. We have an education branch that runs many, many things, but one of their things, they are running our self-paced courses, which means that they can, one of those runs, they can shift a little bit to different public clouds or different public regions, depending on how the capacity looks at the moment. Right. Chris, do you have something to add in the public cloud? Yeah, for us, it's a little bit different since we are a hosting provider and everything. We do have multiple product lines. So our main goal right now is just building a unified hardware platform across our products and everything to pretty much just kind of place hardware wherever the need is. So yes, we do have spare capacity, but we don't think of it in the same way as just cloud capacity. So whether it be like Ram Node, one of our sister companies, whether they're scaling out their public cloud or whether it be for FlexMetal within our private cloud or storage clusters that we provide, we kind of just see what's the man for each of those and allocate hardware accordingly. All right. And Brandon, the big private cloud sites? Yeah, I think it's similar to be honest. Maybe not to quite the scale. It's really the same. We as a team make the same promises to our customers who just happen to be in the same company. So even though you call it public cloud, they know exactly how long it takes you to get hardware in place and get things set up as well. So it definitely makes it interesting. It's definitely a problem that needs a lot of effort to look into. And there's always the confliction, utilization versus having enough spare capacity available at any time for anybody to use. So it's definitely something that we have to balance. We have, I think, almost 40 different clusters, so probably about 40 different control planes, running different instances on different networks for varying different purposes. So the fragmentation sometimes is very difficult to manage. I mean, you've mentioned moving hosts about, it still takes time. It's still effort. It's not free to move things even within a data center. So it can be very tricky to keep on top of that capacity and balance the utilization and never letting your user base have any issues due to capacity, is the key. So I also can give this certain example. If you ask me directly if we have spare capacity, I think I could immediately say no. But then of course you question why. So as I told you, more than 80% of our capacity is due to processing data from the different experiments. And we strive actually always for more capacity to run all this processing power. So only less than 20% of the cloud is dedicated for services. So really for the users for the organization to interact with the APIs and the creator services, also for the IT department to create the IT services for the organization. And of course there we need to have the capacity for the users to create this in a dynamic way. But then we try to use some techniques to reduce this spare capacity as much as possible. And I think we're going to talk about this later, what we do to overcome spare capacity. All right. So how can we assess then a good balance between spare capacity and demand fluctuation? I think when everyone starts their clouds they don't know what's different for private clouds, but especially in the public cloud world they will not know the demand. And then there is a big shopping season for example and everyone starts building websites and requires capacity. So how do you guys manage this demand with the spare capacity that you have? Victor, if you want to go. I can, sure. So basically I think regarding this question, the most important part that we need to consider is the size of the cloud. Yes, you mentioned, we are talking about mainly public ones regarding this one. But also if your cloud become bigger, bigger and bigger, then in this way these issues become smaller, smaller and smaller. Because in case, for example, if you have one million core, then it doesn't matter even in the Christmas period somebody just some projects start to consume 10,000 more or 50,000 more because you don't need to care about that. If they are being used, these one million cores, that will be an issue, right? Yeah, because your clients will be bigger as well. Yes, you're totally right. But if we start at the point that we just discussed that if we keep spare capacity as a public cloud provider, then based on, for example, what I mentioned, we aim is to have the spare capacity, I mean the amount of the spare capacity. But even if we receive such kind of a big, big demand for any customer, then we will be able to serve out and under they consume all of the spare capacity, then we will still have enough time to be able to build up more. So basically, this is our target and this way, naturally, we need to focus a lot on the, for example, on the automation. So basically, we are moving to the direction that we would like to achieve to be able to provide new resources or put new resources to the cloud faster, faster and faster. Because in this way, if we are able to put there more resources faster, then it will be easier also to handle and we don't need to have that much of spare capacity. So then it will becomes more efficient. And the other part that is a little bit more technical, you know, if we are talking about spare capacity, and it was also, if I was mistaken, by Brendan and Eric also mentioned this recommendation, and not just me, regarding this from what you can do if you're, let's say, make the Nova Schedule link more optimal. Because if the schedule link more optimal, and if you have the possibilities for life migration, then in this way you are able to reduce the fragmentation. So then you also need to have less spare capacity and less wasted resources. So this is especially how we see or how I see this question. But because the same things was also mentioned by Eric and Brendan, I'm also very curious about their opinion regarding this one. I think you touched on an important point when you said that adding hardware in a dynamic way and very fast to the cloud can help that. And this is one solution to mitigate this. Whatever I see in the times that we are now with so many shortage of hardware components, this is even a bigger challenge, right? Because I see that purchase orders could be the liar because I have shortage of components used. You're totally right. And what you just mentioned is GPU. So yeah, regarding, for example, this high-performing GPUs by the NVIDIA, in most of the cases, even if you are the cloud provider, in some cases you need to wait months for that. So that's why for these topics, the capacity management is more crucial than in the other ones because it's harder and take longer to be able to surf out in case if there is a big need. And also because these GPU ones, not really much more expensive than the normal compute server. So this way you also don't want to have, I don't know, hundreds or thousands of server unused because yeah, then basically it will not work to have it. So we have a question from Prakash where he asks, how do you measure your capacity? Eric, any others that? In pure measurement, technically we're using the Prometheus stack with a bunch of exporters, but for us I think the whole spare capacity and how to assess if we have a good balance is down to two things, the technical side, where we try to build historical trends and from experience determine if we should scale, if we should scale up down and so on, but also customer communication. I think Victor mentioned before with customer requests and if we see one of our larger customers for example or a new one and we predict a burst increase of resources in one of our regions, we do try to have good communication with those customers and making sure that they will let us know beforehand because the historical trends will only give you so much and the burst increase and decrease as well is probably where we struggle the most in the sense that like you mentioned that in short time scale very much, but yeah I mean to measure like I said we use the Prometheus stack, it's just one of many technical tools for how to do it, primarily with the open stack exporter and some others. Right, Victor? Basically just one question, it's maybe for Eric regarding these measures because you know there are lots of different ways what you can measure and in most of the cases and most of the companies also for example at the beginning start to measure how much CPU cores we have, how much memory, how much GPU, how much display sensor and so on, but then to be able to avoid this thing that we just discussed this fragmentation we also start to measure for and create such kind of a data at the perfect time how much we can start for every single type of different flavors, so in this way let's say we are also able to see our capacity, our free capacity based on how much new ECS and services we can create, so then we don't need to consider for example what we previously discussed on hardware level this fragmentation, you also do it in your environment like this one or just measure the hardware part. No, I mean we measure similar like you mentioned, we try to measure for not only for the reading itself but also for individual computers of how much resources we have left on that compute for example or a network node running L3 agents or whatever it could be, I mean it's really important to think that cap management and spare capacity is not one-dimensional, you mentioned GPUs, you have the network side as well, you even have the control plane when you're starting to talk about how many agents can we actually host in this region depending on how you have actually set up the control plane right, so it's yeah like I said it's multi-dimensional, but yeah I mean we have similar measurements like you mentioned not just for the the region but also for individual resources of different kinds. And Chris, I'm curious to hear you. Yeah, so it's a little bit different, we kind of depend on our private cloud customers to kind of make those determinations for us but we we provide a few different like building blocks for them to choose to kind of scale their capacity accordingly, so it does kind of get pushed to them but for our side we do we do struggle a little bit on just figuring out like what mix to always have on stock just because it isn't logical just to have hardware like that just sitting. A lot of our advantages does come from our full bare metal management system so we're able to keep quote-unquote like warm spares that are just powered off so at least like we're cutting some there but it's still a challenge for us and as we're kind of growing into it it is a huge kind of learning experience for us to eventually kind of fully figure out but right now it's just mostly just looking at a rack see what is used in one of our flex metal pods to see and kind of determine when we order things and yeah with the hardware shortages as well that that's also used because now we have to kind of forecast three four months in advance of how our activity is going to look like. That is interesting this forecasting, we have a question from from the audience it's from Jagia, I expect that I pronounced that correctly, so call capacity controlling using machine learning and any challenges do we use machine learning to to predict this my understanding is not right? We'd like to at some point yeah well to use machine learning also it requires a lot of data right? Basically I think regarding this machine learning is just about how much data you can have so for example for us we really have lots of data to be able to put in machine learning but also what from my point of view machine learning is the tip of the predict but your own business side you are able to see for example this new upcoming let's say big projects so these huge peaks not sure for a normal growing so for the normal trends it's really good if you are able to have such kind of solution but for these exceptional peaks it will be really hard. Yeah so Brendan now we are in the private cloud side do you relate to yeah I relate to you. Yeah definitely those are from a capacity model in point of view I joined the team about two years ago and inherited really the operations of a very large system you know it's like hundreds of thousands of bare metal nodes tens of thousands of VMs multiple clusters we're just actually building out a whole new cluster that we're trying to get the company to move to our metal systems move more into VM you know start using auto scaling you know software load balancing all the new technologies so we've got the the problem just now we were trying to balance the old hardware and get people to move to the new so we've got to scale the new as we reduce the old but you have the overlap where nobody wants to move until we've tested and you're everything's stable so they want more capacity before they actually shut down any old capacity so we're in a balancing act just now and I think we're the you know we're the same we collect basically as much data as we can down to the lowest level you know IP usage you know the volume size usage NFS space you know VCPU memory everything and we collect we actually collect all and send it to our like grid system so we we're very lucky in in the private cloud space and you know the Verizon media we have a team that runs Hadoop and does a lot of big data processing so we can actually send all the data to them and actually run some data processing against it and try and generate some sort of modelling allows us to to manage the capacity better but you know something that the guys touched on earlier was you know lead times as well you know the lead times are are changing quite a lot just now you know it's hard to get hardware when you want it and I think that goes back to something else that was touched on is automation you know automating your processes if you can automate ordering your systems and automate as soon as they come into the data center getting them online in the great place as you know that's definitely you know somewhere where I think putting focus and you can really make a difference because natural hardware lead times are quite tricky just I think everyone's probably feeling quite yeah I relate to almost everything and also I relate to with you what you said has been a private cloud basically we control much more the workloads we know we know how to predict the workloads that we're going to receive so what I told you for this less than 20% that we use it of the capacity for the services for the users we have that capacity available and it doesn't grow a lot because we know that the services will not increase a lot in the IP departments so basically when we replace the machines after five years on service when they are at the end of the life basically we get new servers and at that point we get much more capacity because the servers are newer and they have much more processing power so in that case we get more and more capacity basically we do the same number of resources at the end and then we try to optimize the use of that capacity that is left until it's not used by the users and I think that it's a good topic now to touch next so as Victor said having spare capacity especially in public cloud is inevitable so how can we basically mitigate this how can we or in case of public clouds monetize that in a better way that like that these resources are not lost you won't start Victor basically I can but from my point of view it will be even a more harder to pick than the normal capacity management because naturally also for example what we can see in not the open stack based clouds there are lots of different techniques which is available for example this was instances but it was also a great thing what we just heard in the beginning in the introduction session that if you also optimize your internal processes regarding the test to use these spare capacity not sure it's just work if you have different for example available designs and clusters and then you are able to move your workload between the clusters but beside that the other thing what you need to consider is in open stack naturally currently this put in senses is not available so currently especially for example that I just mentioned what we do if you just simply let's say stop your machines in this way it's not going to seem over and some other things but naturally then you are wasting your resources because it will be unused it's not generate anything for you but at least it's not turning it also lost for you if you would like to have some some better but are you are you saying power off the compute nodes yes exactly if you are able to i saw that things in small clouds i was no no also in some bigger months so if you would if you need to have lots of spare capacity then it's much that's much more wise to stop some of your machines because instantly you don't need all of this spare capacity but you know if you just simply it's if it's installed and it's already included inside clusters you just put it in maintenance and then pull over of them basically under some minutes you will be able to pull over them on so in this way naturally it's not generate money for you not not gives more resources for you but in this way you can also avoid still avoid the losses so it's still better than nothing on long term what what i just mentioned if you are able to figure out some other use cases like to use this spare capacity as a low with lower price and give us a spot instance or for example to give it for some internal internal projects trainings or let's say whatever's then basically not sure it's more it's better because in this way you will not not waste the resources but in this way if it's not directly generate money so if you're not talking about spot instances just use for something else then basically yes from workload point of view it will be better but in this way also it's an investment from the company side to something else yeah i completely agree that yeah yeah so it's it's really hard there there is something that you want to add chris um no i mean i pretty much agree to that like that's why for for us we focus very heavily on how we just manage bare metal as a whole and and just maintain our our capacity and our inventory as a whole just as efficiently as possible eric no i i mean i agree i mentioned a couple of use cases so we tried to solve it as well uh yes we do power off machines but but we also have we try to use this for for pre-prod environment staging environments in the same data center so so not necessarily powered off but for i don't know testing upgrades life cycle management new projects and so on but also try to to shift internal workloads to to different regions based on on the current metrics we're gathering from from for the spare capacity and about for example price differentiation i know there is no spot instances in open stack but price differentiation for example is still an option like instances are cheaper in january but expensive more expensive in december um there is there is some this kind of techniques that the you guys use or quotas for example or start over committing um if you are lacking resources for example basically overcome it uh because you mentioned the loss of different from one of them and what i would like to react is uh monthly based on time based solution and basically from our point of view if you are able to define these customer trends and if you are at a quite big scale then for us this is not an issue so if i check for example our workload how it's in december march april or basically any months there will be no significant peak so in this way this time based ones basically for us is not uh not helpful uh but the other ones what you mentioned this is over commitment i think it's a different topic but regarding spare capacity from my point of view it will not help at all because if you are a cloud provider then at the same price you need to say the same thing so in this way if i create a flavor naturally every time the performance must be the same so in this way if i set up over commitment for the same flavor then the customers will start instantly start to complain they will be not happy but naturally if at the beginning also you create such kind of flavor what use over commitment this is what we also do so we have some machines what is dedicated for a flavor with over commitment ratio two or three naturally it's well known by the customers and it's cheaper but they can also choose uh what we called for example dedicated general purpose but it's also almost like the same without over commitment but from capacity point of view at least for us basically it's it's uh it's not helpful it's something like just a different flavor right i understand brandon yeah yeah i think one of the other things i'm beginning to see more now as well and it all depends really on what tooling and systems people use for deployments but for example we have we have some teams that want to use blue green deployments uh on terraform for example i think we are losing Britain oh no you're back again yeah yeah you're back okay yeah i don't know where i was talking about you know terraform and doing automated deployments with blue green uh mechanisms so you know basically some teams need double capacity and it starts to get tricky to manage we try and use quota you know allocations to manage but then you have to manage learn booted quota versus booted quota and i think it'll you know if you're doing a charge back model obviously you know you're going to charge more for people that are actually using instances but you really want to keep that other capacity available but if it gets to the point where everyone is doing that you really don't want to keep double capacity you know it doesn't make sense at that point so i think it's definitely a balancing act on how and how you can manage that and it's something that you know we're definitely looking at going forward uh to see what we can do there and how how the best way to manage that is yeah there's definitely a lot of interesting uh issues to deal with in managing square capacity thank you i also would like to to tell you what we uh do at CERN um and our use case is completely different from yours use case um as a scientific organization we try to squeeze all the capacity that we have available in data center and for example um imagine when we are the commission the commissioning compute nodes because they are at the end of the life there is this period that we remove the production workloads from them we migrate instances or we remove all the instances if they are for batch processing and there is this period that the compute nodes are still up um because the operations team didn't touch them to start removing them in bulk so they are still on so what we usually do to try to squeeze the everything from the resources is to use volunteering computing like Boeing um basically to process um data for for for the scientific community so two projects that we use is the LHC at home so that is basically uh different simulations that help then the researchers at CERN to improve the LHC uh something that we also run a lot last year was the Rosetta at home so that is a as well a Boeing Volunteer Computing project um that tries to help the design of the new mo molecules for proteins that um and at the COVID times we thought that will be very useful to help so when we have this kind of spare capacity we'll try to use use it with these scientific projects as well um however something that I feel that it will be great that OpenStack Clouds support will be spot instances or printable instances as you want to call it do you feel the the same need of the project that that needs to support this that will be a good addition to the OpenStack um Cloud what do you think about it I would say definitely yes at least from our side so basically we would be happy to see this in in OpenStack and we are already started to think about for example how we can do it on OpenStack basically by our own because as you just mentioned it's not yet in the OpenStack base so I think for for cloud providers it can really help a lot also regarding this efficiency point of view and capacity management point of view to others share the same opinion yeah uh I mean it would be very beneficial for us as well definitely especially in the public to me because price tends to be a bigger thing there we have more smaller companies and in and actually individuals running in the public arena and I think the price point alone for such a thing would greatly benefit us as well and for for I mean spare capacity like you mentioned please Brandon um yeah I mean yeah more so for for like our internal clusters and everything like that and and some of our public we definitely see a great use of it but um for public or for our private cloud offering it just yeah maybe to pass down to to the customer of it it probably be good but for the way we look at it it really doesn't apply to us as as heavily yeah I think I feel this same you know for that big perspective it's probably not quite as relevant as the public cloud yeah okay in our use case in the certain clouds we are not a public provider but as a private clouds actually we feel a lot the need for a feature like this um I can give you an example for example if a hardware is purchased for a specific project and that project doesn't need uh from day one all the capacity that was purchased because they will expand over time um so we fill these gaps so are we gonna fill up those available resources of course what we do is manually we provision batch nodes batch virtual machines in those resources to use capacity the problem is that when then the original user for where the resources were purchased for starts to create instances their instances will fail because the resources are in use for other use case for batch so then there's this manual process to remove batch instances for the original user to create their instances so we see that this is the typical um printable instance use case um and actually we did some work on this so around three four years ago um and this was initially an open lab project with the Huawei so we started to design a solution to bring printable instances into OpenStack later we also work in collaboration with the SKA SKA is the square kilometer array observatory and this was initially thinking um of the scientific uses cases for the scientific humidity inside OpenStack um we gave several presentations around this topic around the printable instances in OpenStack we developed a prototype um that we believe it's quite good actually we are running it in our production clouds which allow us to have printable instances in our clouds however this ties completely with Nova and placements and it will be it will be great if the community picks this up and integrates this with the OpenStack Nova and placement um and I think this is one of the things that the operators community could could raise this concern and this needs for this kind of feature um if you are interested on this um we wrote a blog post basically where we expose our last experience so the code is available you guys can have a look um and let us know if you if you have any comments all right so we are approaching the end um they are several questions uh from the audience um hi Thierry yes I think you want to go through them thanks Belmiro yeah we have lots of questions we probably won't have enough time to answer them all but uh there are a few that we wanted to to plug to the discussion the first one is from Nils Magnus and it's around automation and because we touched on on like machine learning in various ways um so his question is how can automation leverage capacity management is it uh is an investment in these activities well spent or our workloads always so different individual anyway that you can't really plan them yeah I think uh yeah first of all but I would like to say that from my point of view basically any spent on automation not just in capacity part is revert so I think really this is the future because uh yeah it will be much more faster much more accurate and much easier for any other companies and basically also from capacity management point of view can help a lot this is what we also uh touched in the um in the discussion in this session because with automation you will be also able to reduce the time what we what is your lead time to be able to provide more capacity so in this way uh you don't need to let's say think for far for for example for months or years or weeks you just need to think for far for the following some days if you are able to uh fully automate your solution and also there is one other part regarding this the real capacity management automation where we not just consider how we can let's say provide more resources but also what can predict based on the trends and what is able to let's say even on the self decide what we will need and what we need to install so from my point of view in the short term phrase yes every any every cent what is spent on automation is revert yeah i also very much agree with that i mean for for our case and especially with with being able to scale the way you need to you know the the faster you can scale with automation just increases those you know yes like there's there's always going to be scenarios where you're going to get hit and you just don't have the capacity there but if you have the automation also in place to to add capacity you know that that's just as important anyone else yeah i agree the automation i think is key because it gives you predictive lead times you know you know you have a you know deeper understand timelines that things are going to take which allows you to then plan correctly and you know possibly reduce your overhead and your additional capacity because you know exactly how long it's going to take you know for that automation to go through and for new capacity to land and your and your cluster so you know from that point alone i think it's it's definitely worth the time put into automation if you leave it down to individuals that you know everyone's busy everyone gets pulled into different things it's not predictable and it's definitely definitely makes a big difference i think having spending the time and automation side of things yeah i mean i definitely agree as well and for sure there are different aspects of the whole process that are harder to to to automate in the sense of lead times i mean you have the whole ordering of the hardware which is i mean it is what it is but but i mean i completely agree with everyone else that i mean everything that is manual labor should be automated both from i mean both from a lead time perspective but also i mean from from quality assurance perspective and so on as well when it comes to to i mean using the modern github's approach for example and then declaratively move new hardware between stages for in your cdb for example to to automatically take it from newly racked to be in fully production i completely agree as well with the automation um and actually what we are trying to do now with automation is to leverage um ironic to to provision our bare metal resources in the certain clouds in automated and the integrated way okay thanks so the next question is from salie or shanghi and it's actually around storage because we've we've talked a lot about computing and compute resources and but obviously elasticity also also strikes in storage or networking and so it's taking the example of a safe orbit rbt cluster which could have a capacity of 10 000 iops and and a thousand gigabyte storage however as the rbt cluster scales to 2000 gigabytes the iop scale to 20 000 iops basic quality of service allows you to define hard limits for volumes so the question is which one of those strategies is being used in your in your case and and why and but but it it's also the broader question around is basically storage triggering uh challenges that are different from from computing yep basically from my point of view yes it's a it's a little bit different because you know for example for this uh for for storage what we discussed is fragmentation and lots of other things because there you don't have so much flavors you don't you you don't need to consider so i would say from storage point of view it's easier but most of the clouds provider also for example regarding this storage use much higher and almost everywhere over commitment so in this way the only difference that you need to differentiate between two different part one of them is the allocation of the physical the real physical usage of your of your storage devices and basically regarding this one regarding this uh iop question yes it's from my point of view it's it's really important to set up some uh hard limits because then in this way you are able to avoid that uh some customers can affect other ones and i think it's it's really important to uh to make that happen and if you would like to have different kind of because i think it's also really important in the public but i guess also on the private and the hybrid ones to provide different uh iop numbers so for example then you can build a storage solution based on set of this then on SSDs and then you for the different cluster you will be able to set up different height limits for the for the iops so in this way you will be able to serve out everything and based on these hard limits you are able to avoid that any of the customers affect the other one yeah i think i mean for storage which which is for sure a little bit different i mean you have two you have two primary scaling mechanisms one is for for pure storage demands the other ones is for for keeping up with the performance overall based on all the resources that are running there and i mean you also have the the iops per gigabyte functionality in over nkvm you can utilize where you actually get more where you get paid more i mean so the customer will actually pay for both the performance and and i mean the amount of storage that they are using and of course i mean different storage back ends or different kind of hard drives and you you're building based on if you're buying a high-performance SSD storage or if you're buying more i mean you can go even down to like archiving archiving storage which is really slow and so on so forth so so yeah yeah i mean similar our internal workloads that we we do tend to set hard limits just to make sure that we don't we don't store each other in there but for some of our private offerings we we offer storage planes and that have different storage drives so we have that makes between sata ssd and nfm means for whatever options that they they need to kind of skill in or whatever their storage needs happen to be so we don't we don't actually make that decision for them but we give them the proper like platform and we do mostly set up some best practices for them as well okay and brandon did you have a some comment yeah i think we're similar as well you know we offer the different different solutions ssds sata drives even nfs phil manila and we try and implement hard limits but we try and leave them quite high level and then we have one on the system to make sure we don't have different teams stomping on each other because we want to offer the you know that the fastest servers and best servers we can up to a certain level you know as long as there's no issues within the system then then it's okay so it's a bit of a a balancing act between limits that aren't too tight and the monitoring solution to make sure we don't have any issues across the cluster so we try and balance it that way okay we're reaching the end one last question so from from prakash do you see the role of collaboration amongst open stack cloud service providers in differential location to enable utility computing like power grids for load sharing should opens up an infra work with un or other countries to raise the ability to help contain costs and i guess it's a different take at at spare capacity and handling elasticity basically can we just reuse the interoperability between open stack clouds to solve part of the problem i guess it's more a question for public clouds but do you see that as one of the dimensions that you can you can use or is it is it more complicated than that maybe victor or johan i mean i it's a tricky question to answer but i think we're i mean we're seeing something at least in with the whole guy x guy x projects and so on or that are ongoing where we're interoperability and and sharing identity as a consumer of different european cloud providers is is coming in as well yeah i'm not sure yep basically i would say the same thing that was just mentioned that it's really not it's really not easy to do the big transfer and not just transfer also to create such kind of a solution or something like that so every different provider work in a different way so if you would like to do something what can be used let's say all over them then there should be a really huge collaboration and i think it's it's really not so easy to achieve that even if you consider for example what was mentioned here at this power grid then yeah it will be even more complicated so i would i would not expect that in the near future it will come it will be interesting go ahead i don't know it will be interesting to see if if gaya x takes on and and allows really those providers to interoperate and if that shifts the landscape of this like infinite capacity to network of smaller actors rather than a few big actors so i think it's time for us to wrap this up thanks to all of our awesome guests today and this was really a great discussion and we've learned a lot from the diverse viewpoints and and experiences next week i will have another great episode line that will be discussing kata containers which is an open source project supported by the open infrastructure foundation allowing infrastructure users to benefit from the security of vms while keeping the agility and speed of containers so it will be a very interesting one i recommend it also remember that if you have an idea for a future episode we want to hear from you submit your ideas to ideas.openinfra.live and mark your kinders i hope you will all be able to join us next thursday at 14 utc thanks again to all of our speakers who joined us today and see you all on the next open