 Okay. Thank you everyone to show up for this meeting so late in the day. I know it's getting harder and harder to concentrate this type of moves on. So let's talk about how obsolete is your Cloud. I get that question time and again when I do assessments for customers. But before we do, let's just if somebody wants to take picture of this QR code that would be, I'll leave it up for a minute or so. So about myself, I'm Christian Hübner. I am principal architect at Mirantis for services. I do storage and infrastructure as my main topic, which obviously makes this talk somewhere inside the core of what I do in my day-to-day work. I've been with Mirantis for a little bit more than nine years. I've seen everything open-stack. I've seen a lot of technological development. Okay. It never really stops. So a lot of customers have built on one open-stack platform, the next open-stack platform. The question that always comes back is, should we buy some new hardware? Should we upgrade what we currently have? Should we build a hybrid middle of that? So let's have a look at that for today. So I invented a non-existing company because I do not really want to put anyone who actually exists in there. So the idea is we look at an environment that has been created about five, six years ago on the heyday of the Xeon E5 series, and see what of the hardware is still applicable for today, what needs to be upgraded, and how you can decide what solution is best for you. So this is what a cloud back then used to look like. You have controllers, typically something relatively low-powered. The clouds were not really all that big. It's like five, six years ago. You had typically hard disks for pretty much everything, except for the needed journal SSDs, which was always a discussion. Can we make them a little bit smaller because SSDs are so unbelievably expensive? Back then, of course, then we have compute nodes with what was for the longest time, was the most cost-efficient processor that you could buy the 2650 V4. Some memory, again, hard disks for the boot, and a storage subsystem that uses also entirely hard disks to use SSDs, 20 times two-terabyte SSDs, 20 times two-terabyte hard disks, and four times 240 gigabyte SSD. This was big enough for back then to run the journals on the Ceph cluster. And of course, it's all 10 gigabit Ethernet throughout because back then, faster Ethernet was really very expensive. So what do we have at workloads? Normally, you have a wide variety of workloads. So I add up what's there and divide by the number of workloads. So I get an average workload so we can have a basis to calculate on. So let's say we have four, we cross four gigabytes of memory and 100 gigabyte of storage. This would be probably a mix of a lot of Linux workloads and a few Windows workloads thrown in that need a lot more storage. And let's say in our case, we also have what was fairly typical back then, no failure domains defined. So basically everything's just one soup of compute nodes and storage nodes. And of course, after these five or six years, your cloud has filled up to almost capacity, both from a performance standpoint and capacity standpoint. So you have to look at what you can do to make this cloud come, drag this cloud kicking and screaming into 2022. So first, I hear time and again it still runs. Yes, it does. It's slow. You have typically complaints. You do not, we do not have enough IOPS. We have tried to run databases on hard disks with some journaling thrown in. And obviously, the performance was not quite where we wanted it to be. So, but a lot of things have changed in the meantime. First of all, monitoring and delerting originally worked fairly okay with hard disk but the amount of monitoring at least the data that at least we collected was relatively small and it has grown significantly. And also to the point where it's simply not fast enough to write to those hard disks anymore. So first thing that you need to do is if you were to keep your old hardware, would be to add SSDs to your control plane, notes you would be able to actually monitor the whole thing. Second thing is system devices should be SSD. This is also something that has initially everyone says, yeah, it puts fine from hard disk. But for the down times needed and especially if you do something like an upgrade or where you have a lot of notes that need to be rebooted, you will figure out that the performance of hard disk is really hampering you at the time that it takes for notes to come online again. So you have longer maintenance windows. You have overall, it's miserable trying to get all those notes upgraded and rebooted and everything. Next thing of course is that the applications that were written five, six, seven years ago were less IO hungry than the one of the applications that you have today. You have never ending deluge of data rolling in and you have to have something that you actually need to store on. And then also we have had some technical changes. One of them is one major one that you probably a lot of you have already seen is that since self luminous, we have switched from file store to blue store, which has changed the hardware requirements that were never for the self cluster. One more thing of course as we had cloud with that has no failure domain. So we have to actually rebuild it with the proper failure domain separation or we should rebuild it with the proper failure domain separation. But just kind of problematic if you try to do that with a cloud that already exists, especially if you do this with the self cluster, you say, okay, I have a soup of seven notes and now I suddenly separate those into three racks. Then an enormous amount of data is going to start to move. So you will eventually, it will take quite a while. And as your source disks and destination disks in this case, if we were to keep your hard disk based cluster, this is going to take a very long time. Just as an illustration, we had this, the customer who tried to both rebuild as one of those 20 node two terabyte servers. And the server that was SSD based, the SSD based server rebuilt took, I think it was about 40 minutes an hour to rebuild the OSD server and the hard disk based server was still not rebuilt after four days. So, and there's one final point. If you were to put new nodes into your existing cloud, you cannot buy the same nodes anymore that you had like five or six years ago because nobody is selling nodes with eCPUs anymore. What you can obviously do, especially for compute nodes, is find something from a performance profile fairly similar and just stuff it in. But it is not an optimal solution. So we have, technically we have three options. I only put two in the slides. One of them is we simply stay with what we have and live with whatever is in there. It's cheap. You can just upgrade OpenStack. You deal with performance issues. You still have no failure domains properly set up. But it will, you will not have to migrate workloads which in itself is some effort. Second solution is the exact opposite of that. You buy all new environment, migrate your applications, and you can build the new cloud just as you wanted. You do not have a time pressure on it because you have, you are building a new environment while the old one is still running and then you migrate everything bit by bit over there. Downside is it costs money. The other thing is workload migration is required. We have some customers who have workloads that can be migrated very easily because they are short-lived. But other customers have workloads that have long runners as much as that run month or longer. So you may have to drag some data, some applications from your old cloud into the new one. You cannot just migrate by attrition. And then finally, of course, your CFO is going to tell you, hey, I bought this thing only six years ago. Why do you already want new hardware? But, well, that's what we're looking at. And then there's the third option, which would be to use some of the components from the old cluster and build a new cluster around it. One of the potential candidates would be to just reuse the compute nodes, pull them out, stuff SSDs into them and rebuild them. The third cluster, not so much, we'll come to that in a minute, but I would personally say this kind of to combine the disadvantages of both solutions. You still have to migrate some of your, or you have to rebuild some of your environment, and it is still going to cost quite a bit of money to do this, and it potentially makes more sense to actually build something entirely new. So, time flies. What has changed? If you look at CPUs that were in the servers about five, six years ago, like the 2650 that I was talking about, these were relatively old-style CPUs. Intel had no competition. They built basically what they just wanted, and the progress was relatively slow. If you see E5, V2, 1, V2, V3, V4, it was always just a little bit of an incremental build. Then AMD showed up on the scene with the Epic CPUs, and all of a sudden, Intel was not so happy anymore because the performance was of the AMD CPUs was dramatically faster, what did they do? They also made their CPUs or they pushed their CPU boundaries, and they were quite successfully actually. So, if you look at Xeon Silver 4310, which is literally the bottom end of what I consider a professional commercial-grade CPU, Xeon Bronze is not really something that I would put in a server that I run, but this is sort of the cheapest GPU that Intel sells. It's about 50% faster by benchmarking than an E5 2699 V4, which was back then a $9,000 CPU. Memory cost has also gone down quite considerably, and one of the most important changes was that 25 gigabit ethernet is pretty much the norm these days. So, network bottlenecking that we did see quite a bit in 10 gigabit environments, especially with SSDs is sort of a thing of the past. You can also go faster, and the cost is not so egregious that it's not possible, but in most cases, building servers are so large that they are not runnable with 25 gigabit ethernet is actually creating another set of problems around packing too many workloads into one server that you're late. So, if you have too few compute nodes with too many workloads in them, then failure of a single compute node will impact your environment a lot. So, this is something that you also probably should think about avoiding. And of course, hard disks are almost completely obsolete. If you build a self-cluster from those two terabyte hard disks with their appropriate SSDs thrown in, it's not really going to be that much cheaper than a cluster that is all SSD, I mean, within 10% or so. So, let's look at the modern configuration. Control point nodes, they can, this is one thing that I lately have been expecting more is a single CPU system. So, you do not have Numa leg. You don't have network cards attached to one or the other CPU. The first offender on this is, by the way, Ceph, not the control plane, because worst case, you'll use NBME. NBME device is attached to different CPU than the network card. So, you're basically bouncing back and forth Numa leg and you will see this in the latency of the system. So, compute nodes, these are relatively cheap CPUs, make for cheap compute nodes. They are still from a performance standpoint, better than what you had in your old system. You would have about 30 compute nodes to get the same performance profile that you would get out of the 50 compute nodes with the old CPUs. Maybe 25, one of the bottlenecks or one of the reasons why I don't want to go much further with the CPU is if I have to pump up the memory too much, you're getting from this more or less flat thing with 8, 16, 32 gigabyte memory modules. If you go higher to 128 gigabytes, for instance, they are considerably more expensive than the smaller memory modules are. So, keep the nodes in the middle. You could compact this whole thing into an iNote cluster if you really wanted to stuff a couple of Intel, of AMD, a big 64 core CPUs in the end it works. The downside to it is again, if one of these things fails, then you are going to be in hot water. And then, we have 18 storage nodes which are using also again, a single CPU. Something, if you're buying something with high core count at the moment, AMD is superior to Intel tops out at what 36 cores right now. AMD goes all the way up to 64 cores per CPU. So, this is something that should be considered. And we have, I did comparison for a number of customers lately. NVMe is simply at the same price point now where the SAS SSDs are. So, I tell customers to just skip SAS SSD. For, I mean, most of you probably know, but for everyone who does not know what the difference between NVMe and SAS or SATA SSDs is, the NVMe is directly connected to the CPU where the PCI express bus, so basically the CPU writes directly to the drive, whereas in the SATA bus, the CPU sends this to the SATA controller, SATA controller flips from a memory access protocol to a disk access protocol, pushes the data into the disk and the disk itself flips it back and then you do the same thing back on the way out. So, the limitations of a SATA SSD about 500 megabyte per second, newest generation NVMe can do five and a half gigabyte per second. So, it does not really make sense to save a few percent there. You will certainly find that you can never really have too much performance for the same amount of money. So, now the question is what do we do? By do we buy, do we not buy? If your cloud is still not fully loaded, you can try to just patch it and upgrade what you have, but in most cases at the moment, if I was to build something new, I would want something that is state of the art and every respect that is fully tested, that is fully built and I do not want to sit there in a bunch of maintenance windows at night trying to substitute compute nodes, old compute nodes for new compute nodes or even old storage nodes for new storage nodes. So, in many cases, it simply is the better choice to buy, but this is in the end, it's all up to you because you are the ones in your company who know most about this and you are the ones who are advising the management and your peers on what should be done. So, look at what's available, make a plan, decide what you would like to do and then go to management, tell them, okay, this is the solution that I want. In the end, the summary of all this is, technology really has moved on in the last five years, more than more so than in the 10 years before and what you currently can get, especially on the storage side, is simply on a way, is simply not in the same performance realm anymore. So, and the question is now you have a whole bunch of nodes lying around, what do you do with them? One of them, the very first thing that I would do is if you do not already have, that is build a staging cloud, build something that is small, but looks exactly like the staging, like the cloud that you have. Ideally, you have enough compute nodes to build failure domains. Ideally, you have enough self-notes to build failure domains and like three compute nodes, six F nodes and a control planer on top of that and test everything that you want to do with the new cloud on that staging cloud. If you have a staging cloud, the operational risk of your cloud goes down more than half. You could make a whole bunch of developers really happy by building a lab or test a dev system that has lower demands on the system itself and that also where you do really not have that much of a problem with something that is not entirely optimal. So you would be able to build software properly and then only deploy it on the main system with once you have tested it. If you do that with a proper CICD system or any kind of automated software delivery mechanism, you're also going to cut your operational risk by let's say half, simply because the testing and production always has the risk associated with it that something goes horribly wrong and then you are going to be the one who shows up at three in the morning trying to fix this thing. Finally, there's always trade-in programs. Unfortunately, servers from five years, six years ago, for good reason, do not fetch a lot of money anymore, but at least they will get properly recycled. So from my point of view, this is the end of the session. From your point of view, it's more of the beginning to think about what you are going to do on your next upgrade cycle, how you're going to ideally address the problems that you are going to face. And I hope that this session has brought you some insight into what to think about when you're building a new environment or when you are upgrading from one OpenStack version to another from rebuilding something old that you have. So thank you very much for coming and I hope they have a great rest of the show.