 Hi, everyone. Welcome to our session about App on Stack Telemetry today. I'm Julien, this is Alex, and today we're going to talk about telemetry and the 10,000 instances. That's the title of our talk, which we had to rename to be honest with you. It's actually 5,000, which is good enough, but at least we tried. So I'm Julien, I've worked with that for a couple of years now. I do a lot of free software stuff for a long time, and I work on App on Stack Telemetry for five years now. I'm the PGL for telemetry for a year. Alex? Yeah, and I'm Alex Cross. I'm a senior performance engineer at Red Hat. I've been working on OpenStack for about a year and a half now, but I've been at Red Hat for about three and a half years, so I've been working on software performance on various products and whatnot there, so that's pretty much me. So in this talk, what we're going to talk about is, first, I'm going to introduce you to what is OpenStack Telemetry, because it moves fast and not everybody's up to date on what it's up to now. I'm going to talk briefly about the architecture of telemetry in general, even if it's modulable or it's supposed to work. Alex will talk about the work he did with his team on scale and performance for telemetry. It will come up with the results, what we discovered. I'll talk about how it influenced our developments for the next cycles, and we'll do some kind of conglomerate at the end. So first, what is telemetry? It's a long topic, but if you're not aware, we're one of the products in OpenStack, but it's split into several sub-projects. The most famous one, I guess, is Cinometer, which exists for five years now. So Cinometer has changed a lot since five years. Nowadays, what it does mainly is pulling data out of OpenStack, so Nova, Glenn, Cinder, whatever, Epolyl data out, and trust on this metrics to what we call samples and push them into an over-system. Our default system since two cycles is Gnocchi, a data set that we created. All the part that you may have used before, like the Cinometer API, there are a lot of things that have been split into the two-hour project here, AODH, which is put on A, like the letter A. So A and the alarm evolution engine, it was the Cinometer Aligning Code that we took out in its own project. It used to support the Cinometer API, but it's also been deprecated, so now it only supports Gnocchi, so what the alarm engine part that is used by each, for example, if you do auto-scaling. Panko is our last piece that was split last year from Cinometer at the API that allows you to manage events, notifications in OpenStack. So if you want to retrieve the events, like when NISTAN is booted up in Nova, you can ask Panko for a timeline, basically, of everything that happens in your cloud, as all the data in Panko are fed by Cinometer itself. Our last project, Gnocchi, which was started three years ago now, just left telemetry a couple of months ago to live its own life in the wild. Because it's more generic than Cinometer, which is pretty OpenStack specific, Gnocchi is pretty agnostic in terms of who's going to some of the data in it, so it works perfectly well with OpenStack. It supports Cinometer, Keystone, the OpenStack ecosystem basically, but it's more independent, so we moved it out, and that's one of the things that we tested in this setup. So this isn't mainly the telemetry architecture. If you deploy everything in telemetry, you cannot deploy everything if you want. If you don't have anything, if you don't use events, you cannot deploy Panko. It's not a problem. What we tested here is what I like it in green. So it's Cinometer, the part pulling the data, which is composed of agents pulling the data, feeding Gnocchi via as a collector or not, depending if you use it. Gnocchi, it's metrics, API, and storage behind. Panko and A were not tested here. It's not that it's not interested to test them on scale. It is really interesting, but we had to focus on some parts, and that's our main part in the OSP Atlas. So I basically conducted all the scale and performance testing of Gnocchi and Solometer in Okada here, and our main goal was to try to get to 10,000 instances, and if not, as we're working towards that, if we find any bottlenecks, we're going to try to figure out what those bottlenecks are, what's preventing scaling. We're going to look at all the system resources, and we're also going to try to measure the responsiveness of the service. Sometimes if you're not aware what scale testing is, I kind of like to make this a general comparison. It's kind of like if you're testing a toaster or something, like you might toast one piece of bread. What if I take the whole loaf and put it in the toaster? So that's kind of what I do. So we had various workloads we tried here, each trying to target different things. So one of them was just booting little instances and 500 or 1,000 at a time, letting the environment possess so we can see how the back end is performing, and then doing that again, and kind of making this stepped graph until we see it start to fail. We do the same thing, but with an actual network on the instance as well. With Nokia and with Solometer, the more objects you have inside of your inside your cloud, the more metrics you're going to have. So that's another kind of good point there is don't just focus on the instance count. If you have instances, you have images, you have volumes, you have networks, you're going to have other metrics too. So of course, we can't create a giant matrix that's infinitely in size, like 500 instances, 500 volumes, 500. We're going to try to get as much as we can, but it's going to be too much to try to do all that. Last thing there is we try to measure the API's responsiveness as well, because if it doesn't respond quickly, then just because you got 10,000 instances there, it's not really functional. So here's the hardware that I was able to use. Everybody likes to look at hardware, likes to know what it is, so I wanted to share that. You can see that the SEF knows we had a lot of disks in there. We did have an NVME drive for the journal for SEF. The compute nodes, you'll notice that there are some of them that had 128 gigabits of memory, some of them that had 64 gigabits. That kind of bit me later on. You'll see about that. Here's the topology. So the main thing about the topology here is that it's two nicks, both 10 gig. You can see that we're using a triple O. So we have an under cloud and an over cloud. So the under cloud is not part of the over cloud. That's not the cloud that's under test. That is what I did use to orchestrate my workload, though. So I would hit the APIs from the under cloud to create the workload on the over cloud. So the 10,000 instance test, I'm just going to be quite honest here. That was not the first thing I did when I had all this hardware, because you're bound to get to failure there. You can see that I ran a number of tests to start developing all these different tunings there to try to get there. So for this test, we did 500 instances every hour. So I let it have a whole hour to try to process. And you can see some of the tunings there, number of workers, the reduced processing delay, some modifications to the salameter there, publishing directly to Noki, not using salameter collector, setting a prefetch on rabbit. I have graphs. If you guys love graphs, this is the perfect presentation for you then. So I'm kind of known as Mr. Dashboard, so you'll see it in a moment here. The other big thing to mention, really, is the patches. As we run into some problems, it just became apparent that we needed some patches to this. I could not wait for a build to come out. I needed to kind of hot monkey patch, as most people would call it, the cloud at that point. So here's the basic results that you can kind of see. This graph really highlights a lot of what occurred there. You can see that stepped growth of instances. I was doing 500 each time, and it's going up. And then what I did do is I graphed at the same time here, and I apologize if these numbers are small and whatnot. I couldn't predict how it would look on the screen beforehand. But I would put the actual backlog of Noki. You know, what are the measures, metrics? How quickly is that getting reduced and processed so I could see if it's keeping up? If that goes down to zero, I know it's keeping up within the polling interval. I also threw in there, because we ran into some issues with Ceph, the total number of Ceph objects. And one thing to note there on that graph is that's divided by 10 because it was going to throw my axis off so much. So you multiply that number by 10 there. So that really becomes just under 4 million objects in there. So that's way too many, but we'll get into that. So here's the Ceph objects even more. So this one doesn't have the, it isn't reduced by 10, so the axis, which is probably too small, but that goes up to millions there. So the other thing I want to mention here is, so we tested Noki with the Ceph storage backend. Obviously, I had the Ceph nodes there. So we're storing the metric, the processed metric data there, but also the backlog of unprocessed data is being stored in Ceph as well. So what we had found is when Noki can no longer keep up, you'll see an explosion in your small objects inside of Ceph. And Ceph as a file store does not do well with small objects. So here's the instance distribution. This also became a bit of a problem for us, but in the 10k test, I got this evened out as best as I could. So this is literally every single compute node, the 30 of them or so graphed there, and then whether or not the number of instances was kind of even across it. So you can see here, we generally got it even. It's not perfect. This is what the CPU looked like on all the controllers. So like I said, I'm Mr. Dashboard here. So I love to put as many graphs as I can. I like to see that the the CPU is being distributed across all three of my controllers. So on each of these controllers, Noki is running on on each one of them. We also have Solometer, Nova, Neutron, Glance, you know, all of the open stack services are running across them. You can note there that those large spikes in CPU utilization is the whenever the polling would occur, that's that large spike. Now I did do something that is counterintuitive there, and that is I synchronized most of my polling to happen at exactly the same time because I wanted to see, you know, I have a thousand instances booted. I want to see all of the metrics for a thousand show up as fast as possible and then see how long it takes for it to process it. In a real production environment, you try to put some jitter in there and some spread so that way you wouldn't have this rush of herd occur that way. You can also see the spikes in CPU utilization every time I was booting instances as well. So this is the total memory across all the hosts. Whenever I do any performance analysis, I like to think I like to go into the food groups of performance. So I look at the CPU, the memory, the disk, the network. I try to find out where is this bottleneck that I'm going to run into? I mean, hopefully there is no bottleneck, but I'm never that lucky. So you can see there that's actually all three controllers when I just add the memory together and then create a stacked graph out of that. Really, the biggest thing I was looking for here was to make sure I didn't exhaust my compute nodes because there's no swap space on those machines. And if I were to, you know, once I run out of memory, Linux OOM is going to come out and he's going to start killing things. Here's what the disks utilization looked like on the controllers. The one odd thing about this graph here, you'll see is that the top controller zero's got a little bit higher axis on percentage of disk IOTime used there. And what had occurred is actually controller zero is like the master MariaDB node there. So he has the most disk activity due to that. And then it's being replicated to the other controllers. Here's the disks on the sef storage. Now, remember that there's 12 nodes here, but there's not actually, I could not put 12 nodes of graphs because it's just, it's too many. So this is the first three nodes there. And you can see at some point at the very end of the test, stuff really started kind of going haywire with some, some of these disks, their percent utilization started going a little bit higher. It's really not that dramatic. I mean, the percentage of utilization here is pretty low. One of the things I found is you bump the, if you have a higher, if you have multiple retention rates in Nokia, you will see, or multiple definitions, you will see higher disk IOT utilization. What we also see here would be that, let's see the other thing. I can't think of it. I'll come back to it. Let's go to the next slide, though. So this is the network. So this is EM1. So if you remember from the topology diagram, EM1 was doing the storage management and storage network itself. You could see when I, when I approximately broke down around 5,000 to 5,500, once I went to 5,500 instances, that's when we saw that breakdown in the measures and metrics. You can see the pattern of disk utilization really started to change at that point in time. And it's more sporadic, I'm sorry, network IOT became more sporadic, at least on the storage side. Now, if we look at it on the internal API, it stayed pretty consistent the entire time. All right, so let me let me go into the API responsiveness test. So the for this one, I was doing instances with network. I was doing 500 instances every 30 minutes. And then after the 30 minutes, I would run my benchmarks, my API responsiveness benchmarks. These are some of the tunings I had at that point in time. This wasn't the the last test, obviously this wasn't the 10k test. So here's the performance that I found with get measures as I as I bump the instance count up, I only went to 5,000 on this particular test. You will notice that I don't have the max graphed on there because the max would throw off the axes pretty, pretty hard on this one. So but you can see that it performed fairly well the 95th percentile. So 95% of all requests fell underneath that that rate. So I'm not really disappointed with that, I think. Here's a create metrics and delete metrics. So I ran that benchmark as well. Now you'll see that it sporadically, it kind of the pattern we're seeing there really kind of changed after 2500. There's a little bit of a blip on the page that creates at 1500. What actually happened there is to some degree started having collisions on the API with the polling. So when the polling would fire off, the notification agent would start collecting all the data. And then it would push it to no key. And when it would push it to publish it to no key, it would actually, you know, create an HTTP request. So when it would hit the API, if I was running my benchmarks at the same time, I'm competing with the polling of the metrics then. So that became a little bit of a problem, as you can see here. And as more instances grew, the larger timeframe that that that it was publishing stuff to no keys API, it became much more likely for me to hit that. This is a here's some of here's a little bit more detailed view of that so you could see what happened there. You could see essentially that polling happened around here and continued or posting of new poll new data occurred throughout this period of time there. And that caused larger spikes and more variability in the responsiveness. And that's for metrics create create and delete metrics. Here's a create and delete resources. Due to the way that I had structured the test and the timing, this one didn't get affected till later. So this is the same artifact that we see that we saw with metrics create metrics elites. After around 4000, I started having that same bad timing occur. So tuning, I wanted to talk about that. You could add more metric D workers. Depending on your deployment tool, you know, you really want to know how many processes are deployed across your controllers. More processors or more workers equals more capacity. You just have to remember that also means more memory. And if they're consuming CPU, you got to look at the CPU on your controllers. If you're if you're hitting above your your physical number of cores, you're probably competing with other processes then. And at that point, you're not really going to get higher capacity. There's also the metrics processing delay. So there's a delay built into no key where it'll allow some accumulation of metrics and measures. That way, it's not constantly using the CPU. So we had to find a balance in there. So you reduce that delay, you'll have greater capacity, but at the expense of CPU and IO. And IO in our case was IO on the stuff storage. The other things you got to really keep in mind here is you increase those workers. You're going to have more connections into your database. And if your database is h a and you have h a proxy, you better check h a proxy too. Because you don't want to find out what it's like when you hit max connections there. So for salameter, here's some of the other tunings that you can do. You could publish directly to no key. So you'd have to remove the notifier line out of your pipeline dot yaml. And what you're doing there is you're removing the collector and you no longer need salameter collector so you can turn them off. You'll save like 150 megabytes of memory that way by turning them off. And I've also found that the collector is really introducing yet another, more, it's introducing another queue. So if you keep the collector alive, you'll have polling grab metrics that throws it into RabbitMQ. The agent notification grabs them, does some processing there and puts it back into RabbitMQ again. And then the collector collects those and would publish to no key. So why bother having the collector? The collector has already been removed from master at this point. You definitely want to set your RabbitMQ prefetch that I've found in my clouds to be set to zero, which means unlimited. And if you start to overwhelm the service, unlimited means it's going to grab as much as it can and it just explodes in memory. I have a graph that'll show you that. So the default archive policy has less definitions. So whatever archive policy you have, there's the different archive policies there. So you can see the default low one. It only has one definition. So that is less workload there. So Apache, I had to do some tuning with Apache because in my particular cloud, Nokia API is deployed in Apache and we were running, we were running pre fork multi processing module. So one of the things I found was, and this is really before we dug deeper into Nokia and found that the methodology that we were pushing new measures into the backlog was slow. I tried to circumvent that by just giving as much more resources to Nokia API. And that when that didn't work, we had to dig a little bit deeper in there and figure out exactly what was going on. But these are some of the tunings you're really going to want to think about with Apache. What your max clients server limit typically Apache with pre fork deploys with 256. How many servers you're going to start up with on startup? Your minimum spare servers, your max spare servers, maximum connections. So the other last thing I really want to mention and stress here is you really have to be careful planning values in Apache, especially when you're hosting multiple services. Because if one service goes haywire and takes all of the slots in Apache and let's say you have Keystone in there, well guess what? You're not going to get a token anymore. And if you can't get a token, you basically can't do anything in your cloud. So here's some of the other issues. So the main issues we identified from this testing was really that the single Ceph object for backlog. There's already a patch out there. Julian can talk all about the patches he's made and what not worked on for this. So there's not going to be a single Ceph object anymore for backlog. Many small Ceph objects, that's a little bit more difficult. There's also Nokia API with posting new measures. Apache, I had some thrashing going on in there. Nokia could lose the block to work on. I have a graph on that. Connection pool, we've seen that get full. Liburl only allowed or Liburl3 only allows 15 connections. And then we have 64 executors inside Solometer. So they're all fighting for those only 15 connections it can have. Solometer also rather than pre-fetching too many messages. So here's some of those graphs. So the slow API that I was talking about. The original implementation is this threaded implementation. And what would occur is a new data was posted into the Nokia API. And then it was split out amongst a number of threads. And the number of threads was based on CPU count. And then it would take each one of those threads would append a key into an OMAP object in Ceph. And that just turned out to be total haywire trying to do it. You could see there on the axes. This goes up to minutes on how long it was taking. And I highlighted the max latency, the average, and minimum latency so you can see the spread there. And then if you look after I put a patch in for batching, how much we reduced that. It reduced it down to seconds. I mean this graph doesn't even go up beyond like 4.5 seconds up there. So here's what I wanted to demonstrate with a patchy thrashing. And that is basically when we had the threat, the threaded API versus the batch API as well. You could see that we're just constantly creating a whole bunch of processes and then slowly they're getting killed by patchy because the minimum spare servers was like too low. So what I had done to try to help with that is give it a whole lot of slots and have a lot of them existing at the same time. So here's the lost block. So if you look at it you'll see where it happened. Right there all of a sudden I'm no longer returning to zero or I'm not returning to zero as much. The instance count was static at a thousand instances. And you could see this is the backlog. So we throw a whole bunch of measures on there and then they'd get processed down to zero. So I know I got all my measures, all my metrics processed. You could also see how Noki stores its backlog as an in-sef, as small objects as well as that key appended to the single object that would help store the backlog. You can see once it lost the block I lost capacity because whenever I'm at zero I have more capacity. I can grow in there. I can throw more instances and it's going to process it. Here's the slow status API. So one of the ways to monitor Noki is to check its status so you can run just the Noki status from your terminal. And you can see how many measures and metrics are on your backlog. Well at one point once when this test bed crashed essentially when I lost the capacity to handle everything at that point you could see these dots are the actual max latency for the status API. That now creates these artifacts in my graph where I'm not getting the data so it's just drawing straight, Grafana is essentially drawing straight lines between whatever data points I have. That is certainly something I could fix on the visualization layer but you know that's incorrect to think that flat line like right there. That's probably not what was exactly going on. There was more measures posted so it had to go up. So here's what happens with the unlimited prefetch. So you can see here this is the my memory on my controller. You can see at one point I booted to me instances and at that point boom. All of a sudden the memory growth starts to go up. So I had to dig through my graphs and figure out who who's responsible. Play the blame game there and a salameter collector you can see it grew massively. I mean 37 gigs of resident memory that's a lot. So other various issues I didn't you know this is this was like a like almost an epic two week story of issues that kind of ran through and whatnot different things you have to tune. So just getting the number of instances on that many computes there is definitely some challenges there. VertLogD there's you know if you create too many instances on a single compute eventually you're going to run out of file descriptors on VertLogD. I found difficulty in distributing small instances. At times I was able to schedule greater than max instances per host. You overhead memory look like it was becoming an issue. We had uneven memory on some nodes even discovered SMIs. So if you're into performance you might might know about those. So here's the instance distribution when you look at it with VertLogD. So you can see that essentially I hit the max limit of 252 on each of these guys. I don't know what happened to these two. These two just did not get as high either but at that point I could not boot any more instances. I just get error too many file descriptors. So here's an example of being able to get greater than 350 instances. You can see those top four that I have graphed right there. They're above 350. So even though I set that and I set the num instance filter I was able to get more scheduled there. I need to do a little bit more research to figure out exactly how that happened. Though I suspect it's because I have three controllers and three Nova schedulers that there's a potential race condition in there. So uneven memory this is what would happen due to uneven memory. You can see one compute here has 64 gigs of memory and one has 128. Well you don't do any of this tuning there to Nova when it's selecting instances. You're going to end up with a distribution that looks like that. And you can see that two nodes here have two of the nodes that have the 128 just got like clobbered with the number of instances. I mean upwards of 700 on there versus the other ones only having like 90 or so. So if you set the RAM weight multiplier to 0 you'll get a better distribution. Here's an example of what the overhead memory issues that I was running to you. I was using a very small flavor. I called it M1.x tiny and that's 64 megabytes of memory. But you can see here the number of instances graphed over the memory, memory utilization. So you can see it pretty much generally follows that stepped pattern there and you can see at some point memory gets reclaimed as well. I'll have to look into exactly what occurred there but at this point I was thinking I'm not even going to make it to the number of instances I wanted to get to you with the computer had. SMIs if you guys haven't heard of this you need to start thinking about potentially looking at your bios settings on your servers. If the power settings for it are set to if it's set into like a BMC like the BMC is going to control the power. Well to do that it needs some CPU and it's going to wake up and it's going to take your CPU from your operating system without you even knowing it. So this to demonstrate that here this is the first time I've visually been able to see what see how this is how this has been problematic. This bottom node this one was having like 480 SMIs over every 10 seconds and you can see the CPU utilization is higher on that one versus the one on top. So they both have the exact same number of instances booted on them but this guy is using a heck of a lot more CPU even though those instances are idle. As soon as I turned as soon as we changed that bio setting to OS control they both match each other exactly. So if you want a deterministic performance and deterministic results you're going to need to hunt these things out of your test bed. And here's another one. One sep node decides to have higher disc utilization on its root disc. I have not been able to root cause this guy yet but you can just see that I can easily see that that first disc SDA is has a little bit higher disc utilization compared to another one even though they have basically the same workload running against them. So future performance and scale things that we're trying to do, probably the next biggest thing is to really investigate metric D's processing and responsiveness. I truly believe that we could really get Nokia above 5,000 instances, get it to fully process. I want to run this again in the future and prove to you guys that that can happen. Obviously we need to get there though. We're going to need to do some sep tuning probably. Sep Blue Store looks very promising for the small objects. The other one would be isolating the ingestion of new measures and the retrieval APIs. You already saw how at certain instance counts I started having collisions there. And I also want to take my benchmarks and I'm going to contribute them into OpenStack Rally. So back to Julian. Yeah, so first of the work of Alex, we gather a lot of data, which are very interesting, but we cannot gather on our own when we do develop them because we don't have this kind of hardware every day to run the algorithm we think is faster or more efficient to distribute workload or something like that. So it was very, very useful. We implemented a lot of things, even because I mean, we got like it to have a guy like Alex doing the stuff where he was able to monkey patch and try things during his tests, which we did not have to wait for his feedback, write a patch, and then have it in testing again. So it was pretty cool to have that because we were able to merge patches pretty fast into master with things like the batching of the new measures and to set or other things. In the next version of Turkey, which should be released in a few weeks, I hope, we will have a lot of these things that we found out during this test is one of the things we are so working. It's the new Oscar doing off MetricG that Gordon here is also working on, which should improve a lot of the things we discovered about MetricG trying to catch up with a new number of measure coming. We also did, I mean, Alex did that already in his deployment, but we deprecated the collector of kilometers during this cycle. So that means it's going away pretty soon. It turns out it will simplify a lot the architecture and the model of cinematometer, which is a good thing. And we also lower the usage of rabbit in cube, which is over always has too many things on it. So everybody's happy with that. So to conclude, we're pretty happy with the work we've done so far. And I can recommend you guys if you have team of birth team, death team to make them work together. It's it's it was a very, very great experience for for us guys developers where we do write code of the year and we not always sure but we do the right picks the right defaults the right algorithm to to schedule things to to some metrics where so it was a pretty good experience. So as we show you we can scale up to five K in notes. We have to reach 10 K in the next cycle, hopefully. And maybe we will be there next time to show you. And it's also not clear we do scale to millions of metrics because like Alex said at the beginning when you boot up five K instances, there's a tens of metric for each instance. So that's a lot more than five K metrics. And it's not clear also for us if the rest of opposite scales that much. So we're not really worried about that. If you got any question would be a bit too respond. One other thing I may stress and I don't think I stressed it when it was up there was, you know, if you're using telemetry please try to understand what you're doing with that data and what you want out of that data. I feel like a lot of people just have it on in their cloud and they don't even really maybe know or they didn't know what they wanted out of it. Do you want billing data? Do you want performance? Do you want capacity type metrics? Each three of those different type of workloads are going to kind of demand a different maybe different archival policy a different polling interval. So really think this stuff out before you just start turning it on and then just expecting it to solve stuff for you without even thinking about it. Any questions? Just a quick clarification on the last slide you had a bullet item that said that telemetry scale to 5,000 nodes easily. Are we talking about nodes or instances? Oh, instances. Yes, all right. So that's the number. Yeah. You've alluded to that's more than 5,000 metrics. How many metrics per instance was it? So I believe I'd have to go back and look at my notes to get the exact numbers but it was somewhere around like 55 metrics is collected per instance. So you'd take that multiply it by 5,000 and you'll have the number in there. There's also like I did have two images sitting in glance. So those each have like four I think four metrics each that are being collected as well. And what was the collection interval? The collection interval for the largest scale test was 1,200 seconds. So that's 20 minutes is more for like a billing type use case. Yeah, like the you could see the data that I presented in all those graphs that was collected at a 10 second interval. So that I have different tooling that I use there over a large number of metrics and I use a different tooling set to do that. Actually, we're actively looking at how we can try to bring Noki in there to do our processing as that. So we're going to dog food it. Yep. Don't any testing with the new set BlueStore back in? I have not yet. I really want to though. So if somebody could help me deploy that. I'm going to have to find the right red header there. Yeah. Any other questions or think we're out of time now, right? Yep. Well, thank you guys. Thanks, everybody.