 Hi there. Thank you guys for coming today. We're really excited to share our user story with you. A series of unfortunate deployments running a Lambda architecture on OpenStack. Before we get started, we're going to take a moment to tell you about ourselves. My name is Monica Rodriguez-Stankey. I am a lead DevOps engineer at CAS in Columbus, Ohio. So most of my job really consists of shepherding applications and tools throughout the deployment lifecycle. As you'll hear today, that often involves a lot of pain and suffering. So Chris, Scott, I want you to introduce yourself. So hi. My name is Scott Copeland. I'm a senior technologist at CAS. I've been there for almost 17 years now, building web-based search and retrieval applications for our customers. Like Chris introduced himself now. Hi. My name is Chris Brew. I am a rack-space dedicated architect working with CAS for about eight months now. In that time, we have pushed the envelope of OpenStack on Liberty, done some crazy things, done some awesome things, and as we're about to tell you, done some unfortunate things. A little bit about rack space. We have a global footprint. We're in 150 countries with 12 nationwide data centers. We have a portfolio of hybrid, dedicated, and pure cloud. Of the 5,800 rackers we call ourselves, about 3,000 of them are support-based. So we are heavily influenced from a support standpoint. And I'll let Scott introduce a little bit about CAS. Yeah, so the story we're going to tell you here, in order to understand it, helps probably to understand who CAS is. So you may not have heard of us. What we are is CAS is a division of the American Chemical Society, which is a congressionally chartered nonprofit institution dedicated to the advancement of the chemistry discipline and its practitioners, supporting its practitioners. So in that regard, we are located also in Columbus, Ohio, right next to the Ohio State University. Any bucks? No, OK. Well, anyway. Who we are, and that's who we are. Now, what we do, so we are technologists, scientists, and business leaders. And what we do is we pay attention to the entire world of chemistry. So regardless of where chemical information shows up, be it in journal articles, patents, or basically any other sources, going back nearly 200 years, we've built some very rich databases that capture all that chemistry knowledge. And it's these databases that we use to power the products and services that we offer to chemists, not just in the US, but really globally. We are a global solution provider. So we have a variety of products that we offer. The one that we're going to focus on for this story today is known as SciFinder. SciFinder is a product aimed at being easy to use for working bench chemists to explore this rich world of chemistry. But before we go specifically in the SciFinder direction, I want to take a step back and introduce this concept that we alluded to in our subtitle. And it's called Go to the Next Slide. There we go. It's called the Lambda architecture. So just quick show of hands. Who in the room has heard of the Lambda architecture? Excellent. So maybe I can go through this fairly quickly. But for those who haven't heard of it, this is a term coined first by Nathan Mars, I believe, in 2011. And it's a general description of a big data architecture designed to scale technically and also provide an important characteristic called human fault tolerance. And we have a particular variant of the Lambda architecture that we do at CAS. And you'll hear us allude to this acronym in this presentation. I want to make sure you understand the terminology we're using. We call it our Information Access Platform, or IAP. As any great technology organization, we love our acronyms and we attach them to everything. So the IAP conforms to the Lambda architecture for the most part. And I'm going to try to go through and describe the various portions and how they map to what we're doing architecturally. So the Lambda architecture first starts with our master data set, what we call our content. Now that's that nearly 200 years of rich chemical information that we've gathered using our scientists and analysts. That master data set is then, we call it compilation. This is essentially in the Lambda architecture, the batch layer. And that's used to produce what we call an IAP index in the Lambda architecture. This is the batch view. Key characteristic of this, this is an immutable view of the data designed for serving use cases. Now unsurprisingly, we use Hadoop primarily to power this use case. It's not the focus of this story. The interesting parts that we talk about come later in the Lambda architecture, where we talk about the serving layer. Now this is where we load that IAP index, that immutable batch view into an IAP search engine. And then we use that to answer the customer's queries. So unsurprisingly, given the venue, we're using OpenStack to power this. Now we've done this architecture on physical hardware before in a cluster computing setting. What's new is that we were bringing it to the OpenStack ecosystem because we needed the power and flexibility to right size our solutions as we build out our portfolio of products. So from here, I wanna pivot back a little bit to the SciFinder story, the SciFinder problem. So how big is this data set that we're dealing with? So one dimension that you can measure that in is in the variety of content that we deal with. This isn't a single type of content. We deal with lots of different kinds, whether it's index concepts, documents, patents, chemical structures, and reactions. So all told though, when we sum that all up, we're loading a search engine of a property graph over 1.1 billion nodes and 11.3 billion relatable edges. Okay, and those aren't just simple nodes. Every one of those usually has something like a blue scene searchable text index associated with it or our own proprietary chemical structure match algorithms. So this is a really big problem. The size of this index that we compute is 7.5 terabytes. And so this is a problem that's too big to realistically be handled on a single machine. So we employ clusters of machines in order to serve that index. They're connected with high performance, 10 gigabit networking, and we have 14.4 terabytes of RAM per cluster that we largely use to load this into. But then we use the rest of the memory and the CPU cores available on that cluster to serve the user's queries, using about 1.7 million threads running around, which we employ pipeline parallelism, data parallelism, and task parallelism algorithms. And then we take that whole arrangement and we replicate it multiple times and we handle the customer load. So now how does this map into the Lambda architecture and open stack? We like to call this the scene of the crimes. Well, we have a fairly custom use case. There's some moving parts in the Lambda architecture. And so we started out by implementing our own orchestration. We use Apache J Clouds for this. We had a lot of Java development experience in-house, so this was a natural fit. From there, the first step in the open stack realm is we allocate some compute nodes and some block storage so we're making use of Cinder and Nova to do so. And we copy the data from our HDFS, our Hadoop ecosystem onto that block storage, staging it for use by the online system. We then go ahead and throw away those, hopefully we throw away those compute resources. Okay, we throw away those compute resources, return them to the pool. And then from that point, we allocate a high performance cluster for these things connected with a network. We then attach our block storage where we've sharded that digest, started that IP index, and start up our service where it's ready to answer customer queries. So we make use of a lot of different open stack technologies here. The first one we're gonna focus in on for our story is what happens when the team encounters, oh, drum out, laggard networks. Thank you, Scott. So Scott mentioned our high level process. We will deploy a cluster and in those clusters, we will stand up an application stack. So through in this application stack, we have many virtual instances. Each of those instances has a number of separate network interfaces attached to support the different functionality of the application. So I have a few expectations when I do this deployment. I expect networks to attach and I expect them to stay there forever or until I say differently. And I expect the VXLAN, it's the kind that we're using at the time, the VXLANs, I expect them to perform reliably. So this seems reasonable, right? Well, we thought so too until we tried it. So what did we actually see? Networks were falling from the sky. So this is where your hero, the DevOps engineer, gets the dreaded 3 AM phone call for production support. So did you know that when you're troubleshooting, checking to see if a network interface is attached is not typically the first thing you look for. It's probably not even in the top 10, right? You just don't expect it. If I go plug a network cable in, close the door, it's locked, it should just stay there, right? So same thing here, but virtual. So what you find here is the monitoring solution that has alerted you in the first place probably doesn't inform you of the actual issue, right? It says, hey, there's something wrong over here, but I actually have to dig through that entire application stack to find what really happened. So here we have at 3 AM absolutely looking for a needle in the hay stack. So with the networks that actually did hang around, they performed poorly. So you can see from the chart up here that we were only getting about a third of the throughput that we had from our physical environment. So Chris, what was really happening here? Thank you, Monica. I switched it. So first let's do a little background on what a VXLAN is. At its most basic, a VXLAN just means encapsulation. We're taking that packet and adding an extra header onto it so we can route it through the various network gear and the computes themselves. So there are too many issues identified with VXLAN networks. The first was that we were not able to use NIC hardware that allowed for VXLAN offloading. This meant that all of the encapsulation, decapsulation had to happen at the CPU or the processor level that added a huge overhead from just a throughput standpoint. What Monica mentioned is we saw roughly a third of what we'd get from line speed itself. The second was VXLAN. Since we are encapsulating that traffic, the address database and the activity databases have to live somewhere that can be seen from a VXLAN standpoint. This means those live at the compute level and they're replicated across the cloud themselves. So when we were running VXLAN at scale, we had multiple hundreds of computes themselves. Database entries were getting lost. Instances weren't able to ping each other. Corruption was happening across the board and networks would disappear. Unfortunately, to solve this case, we actually just switched to VLAN. It removes that encapsulation layer, lets the switches handle the ARP tables and the address tables, and it gave us the performance that we were looking for. Not as fast as line speed as we were looking for, but we got double of what a VXLAN could give us. So in that case, it was a story of us being unable to work around VXLAN at the scale we wanted it to and at the time that we're using Liberty by the way. And at the time with the Liberty code and the OVS software and what like that. So to solve for this particular instance, we had to remove VXLAN and go to VLAN. So we had a partial success, but also some tragedy with that particular story. The next story that we're gonna talk about is what happens when we spin up extremely large compute instances. You remember, our scale of our problem is too big to be solved on one machine alone. We ideally try to scale vertically as far as we can because of distribution costs. So we're talking about very large compute here. Now, that seems like a great plan, but not everything when exactly is planned when the team encountered murderous kernels. Thanks, Scott. So we are deploying full-size instances. They're full compute hosts and we deploy a cluster of them in order to support our application. So what we need to happen, what I expect is to be able to spawn these instances consistently and for them to perform reliably. So our search engine actually manages its own memory, so we want the hypervisor to just stay out of the way so that we can maximize the use of those resources to support our application best. So that all seems pretty reasonable, right? Well, we thought so too, until we tried it. So what actually happened? Well, the application performed poorly. So poorly, in fact, that it stopped functioning at all. So what we would see are stuck CPUs in syslog and we would also have other metrics showing us that CPU was at 100% busy, even though nothing was actually happening on the host. So what would happen for me as the DevOps engineer, the victim, I have an entire VM that goes into a shutoff state and I have no idea how it got there. So I'm stuck playing whack-a-mole trying to resurrect all these instances. And this went on for weeks at a time with the simple explanation of open stack is unstable. We just didn't know what was going on. And as Scott had mentioned earlier, one instance going down, even though we've got a large cluster, it does cause the entire application to go down. So we just spent weeks battling constant downtime of this application. So Chris, what was really happening? Thank you, Monica. For murderous kernels, there's a few things that we identified as the root causes here. Numera, memory imbalances. Numa means non-uniform memory management. It's a newer architecture where memory is allocated to specific regions and those regions, they're not applied to the processes themselves. So from Kaz's standpoint, they wanted to use all of the hypervisor, like they mentioned everything, all the CPUs, all the memory, they wanted it all for the instance itself. What would then happen is since we weren't using CPU pinning of any kind, the process from the instance gets floated amongst all the physical CPUs at the hypervisor level. So we're resulting in some memory imbalances. So the instances getting mysteriously shut off was out of memory killer, going through and saying I need to protect myself. What is the largest consumer of memory? Instance, kill it, dead. We didn't have any sort of isolation between the instance and the hypervisor itself. That was resulting in no processes taking too long to process, to do work. Networks getting slowed down, all kinds of just general wonkiness in the environment and instabilities itself. Without the proper isolation between the two, the hypervisor didn't have enough memory, the instance was running into contention, CPU locks, just general craziness. So to solve for these, we implemented CPU isolation and pinning. CPU isolation was done with a grub config that said remove these number of CPUs from the kernel scheduler. Reserve only, basically schedule on the sub small set, reserve the rest for instance application itself. We did that then, excuse me, and then we used a CPU pinning to put one virtual CPU dedicated to one physical CPU. This solved for the NUMA imbalances, it solved for the resource contention and it allowed the hypervisor to have a dedicated set of resources that would not get in the way of the instance and the instances get not in the way of the hypervisor itself. From a NUMA standpoint, we saw that within their application, they were seeing a bunch of TLB cache misses. This was the way we were trying to allocate huge pages. Before they were using one meg huge pages and like they mentioned, they're trying to load nine or eight terabytes of data into as much memory as possible so they can search as fast as possible. To solve for this, we do pre-allocated one gig huge pages now that solves for some memory fragmentation, it allows the application now to have a properly balanced NUMA region that they can allocate dynamic, excuse me, allocate dynamic memory from within the application. That's what I was trying to say. So from this standpoint, we actually were able to solve for the instance instability at the hypervisor level. This is the one I think that we've made the most progress on and we consider this one a successful journey. Thank you, Chris. I agree, that particular tail was probably the least tragic of the lot that we're here to tell you about today. But I warn you, the next tail we are about to tell you puts all the other ones, all the other tragedies to shame. And I will also say that for the faint of heart, you may need to leave the room now. As we talk about what happens when we encounter recalcitrant storage. So our data drives our infrastructure and our software. So changes in the amount and the structure of the data require different configurations at deployment time. So our overall process here is we deploy a bunch of hosts and those are those full-size compute hosts. We create a bunch of volumes and we will attach those volumes to those hosts. Whenever the data changes, then we will actually detach those volumes, destroy the volumes, destroy the host and get a whole new cluster in its place. So I have some expectations when I deploy, right? As far as the storage goes, I expect to be able to attach these volumes in a consistent amount of time. So in our deployment process for one cluster, we are doing 180 API calls to create volumes, 180 more to attach those volumes to compute hosts, then when we're done with that data, 180 more API calls to detach and then another 180 to destroy and that's just for the volumes. So remember, this is just for one cluster or our applications now are multiple clusters so you can just take those numbers and multiply upward. So my expectations seem pretty reasonable, right? Well, we thought so too until we tried it. What actually happened? The truly despicable world that we lived in with detaches. So we would have a volume attached request was issued and nothing would happen. We would issue a detached request and it would go into a detaching state and then nothing forever. So it was just never heard from again, right? So this is where I ask you to take a step back and appreciate what we're doing here. So there is a great amount of data being replicated, a large number of instances being created and there's a ton of work, right? And this takes hours to get through because of the complexity of the data. We're spawning that almost two million process is across that cluster. So every time that we would fail here at one of these attached or detached points, it was really ungraceful and very unforgiving. So we would lose half a day or an entire day or more until we could actually move forward. So what was really happening here? So this one we were able to track back to three main components. At the OS level, at the OpenStack level and then at the back end storage vendor level. At the OS level, when Monica was trying to attach and detach it has to make a call to MultiPathD because we're using iSCSI backed devices from our enterprise grade storage solution that you probably know extremely well, three letter acronym. So what was happening is MultiPathD was an issue there. OS brick was Liberty's first attempt to consolidate all of the attached and detached codes scattered throughout all the various projects into one shared library. I did mention that was their first attempt at doing this. And then I mentioned our back end storage API response and functionality was severely limited. What we were seeing, MultiPathD was crashing. That is when Monica was trying to attach or detach. If MultiPathD does not exist you're not going to be able to attach or detach anything. This was happening because the way we are attaching volumes we're doing, one Cinder volume has 16 paths within MultiPathD and now replicate that nine times because we're attaching nine volumes per instance. It's a lot of numbers, right? So every time we try to detach we have to go through and unmap 16 different devices from the OS. Enter OS brick. There are first attempt at standardizing attaching and detaching. The code path through OS brick itself was completely flawed. It was a one and done. If it failed anywhere it just completely aborted and the volume went to error state. What was trying to do those it has to go through and flush and re-scan for every one of those 16 paths multiplied by nine times was causing MultiPathD itself to just sig serve and die. We were able to get around that by writing our custom compiling our own version of MultiPathD that pulled in some fixes from upstream. That basically said don't die. I don't care if you don't finish cleaning up your work but just please stay around so that when Monica issues her next to detach or attach you're there to do work. That plus moving it under upstart to say okay even if you do die I want upstart to handle restarting it itself. Those got us in a semi-happy state. MultiPathD still wasn't cleaning up everything it needed to do but at least it was around to do some work. Enter OS brick then too. So we've already mentioned that OS brick itself the path through the code was problematic at best. So we designed a custom code patch for the chemical abstracts to allow us to say first please retry. Don't just want it done. If you fail to find the path or you fail to clean it up please try one more time and another time and then give up after that. We changed some of the flow so that we're not flushing and then re-scanning on every path. Do it once to find all 19 devices you have to do and then loop through those without flushing and re-scanning every time. And the third thing we had to do is we unfortunately had to put an artificial pause in between device unmappings to allow the OS to be able to keep up with all the commands we're sending to it. And the third one revolves around this back in storage device you all probably know extremely well. Its API was probably the most problematic API I've ever seen. It can only handle requests serially. So think of it from a sender standpoint. Monica is saying I want to detach all of these volumes and I want to do them now. All of those now are requests from within a messaging system. When those requests hit the messaging system we start a timer. That's the RPC response timeout. If this API could only process one task at a time and we've sent it 180 tasks to do by the time we got down to 20 or 21 or 22 the messages were dropped, right? The messages are no longer valid because we've hit our response timeout. Enter volumes that are stuck in perpetual detaching state. The API in the back end would still do its cleanup work but we no longer have a message to map back to say that this is now cleaned up. Put this volume into an available detached state. That one unfortunately we're still trying to work through and to help us and this is something as an operator and an architect I hate to do we had to go back to developers and say please help us. Their API can only handle so many requests at once. Please put in some artificial throttling within your J Clouds orchestration layer that says only detach one volume per instance per compute at a time. That got us somewhat closer. We also raised the RPC response timeout to five minutes which is less than ideal from a learning and a monitoring standpoint. That got us about 90% of the way there and now we're still fighting with the last 10%. So this one unfortunately is not solved. We've made some significant progress but the issue is still around. So let's go ahead and summarize here. So in the network space what was our problem? We had slow and unstable networks, right? They were falling from the sky. Well what did we do? We replaced VX lands with V lands and that worked pretty well for us. We did get some of that throughput back but we're still not quite where we need to be from compared to our physical environment. In across a given application deployment or? We've been living in the V land space so we're gonna be underneath that. For a given stack we're about 100 VMs and then we end up with about double the number of network interfaces for one stack for one environment. Multiply environments and then other applications. So just yeah. In our compute space what was the issue? We had stuck CPUs and crashing VMs, right? They would go into a shut off state with no evidence of how they actually got there. So what did we do? We implemented CPU pinning, NUMA alignment and we also reserved resources for the hypervisor itself. From that we got stability performance and we feel pretty good that the case was solved there. And in our block storage area we had have unfortunately failure to attach and detach volumes at velocity, right? We're trying to do a lot of work and we wanted to actually be successful and finish in a reasonable amount of time. So what did we do? Well unfortunately we implemented throttling which is on our side that's a big band aid. We did have some code patches so there were some good things that came out of it but this is still an ongoing issue for us. So Scott what's in our future? Hopefully less tragedy. But there are maybe some obvious things that we'd wanna look at here going forward. We noted that we still have some network performance and we are a very network intensive application. So taking a look at offering SRIOV on these instances and pinning the network like we've pinned the CPU in memory successfully would be something we'd like to look at. We'd also like to take a look at extending our memory and CPU pinning model to the rest of our cloud. It's been much very successful in stabilizing our instances not getting over subscribed CPUs in memory and that's something we'd like to see elsewhere as well. We'd also like to take a look at different storage solutions altogether. I'll just go back and say if you were at the keynotes this morning and you were hearing Boris talk about what happens when you take common enterprise solutions and apply them in the cloud space I think we may exhibit that cautionary tale of what can happen when you do that sort of thing. So we'd try to be more cloudy maybe look at some different instance storage options here. Obviously you might be thinking to yourselves why didn't they just use ironic? Well in the version of OpenStack we were in it wasn't offered on a single control plane that maybe changed at this point so we'd like to probably take another look at that as well as containers. We are using Docker throughout a lot of other portions of our architecture and bringing it into this space would probably be a good thing to look at as well. So that said, we'll open it up for questions. Can you hear me? Yeah actually that was one question seemed like the application would be most suitable for something like Kubernetes, scale out. On the networking side have you considered using a non overlay solution? Obviously even if you're moving to VLANs you still have an OVS in the data path that's processing every packet. Have you considered something like Calico where you have a pure layer three solution without overlays and essentially there is nothing in the data path so you would be getting bare metal performance. So we're doing just straight L2 VLANs so we don't really have anything that's interpreting like from an OVS standpoint. It's all done by the kernel itself. The question is why? Why? So even in the kernel you have additional overhead versus going with something like using Calico as an example where you have nothing in the data path. That's an excellent question. I guess we haven't really dived into that. We haven't gone down that path, right? So the problem we were trying to solve was an immediate we have one third of our performance of line speed. How can we get better? And VLAN was just something that Rackspace provided as a reference architecture. But we haven't actually looked at alternate solutions to VLAN itself. Yeah, I think those are definitely things that we could consider as we're looking to ring more out of our network as well. Thank you. Thank you. What kernel and if you can say what distro are you using that you've ran into these new scheduling issues? It was Ubuntu 14.04 LTS. Okay, but are you using the four kernel or the 1304? It was the 318 kernel at the time. What is that again? The question is, which hypervisor are we using? For like KVM? Yeah, okay. I thought you meant like physical gear. No, we use KVM. I had the same question, but for the multi-path. So thank you. So that has been fixed upstream, by the way. They were able to pull in, it's not version five. It's the one right below it. That fixes the SIG serve. We just happened to run in it at the same time that the bugs were being filed. Yeah. So I got a question concerning the memory reservation that you do because we had some very similar issues and we also solved them in a very similar way. So my first question is like, how much memory do we reserve actually for the hypervisor? How much overhead do we have? We reserve eight gig. It's a little aggressive from our standpoint. We could probably get by with four, but by default it only reserved two gig. Okay, but eight gigs is per, so how big is the machine? What function is that? 768 gigs total. Okay, that's not too bad. Yeah, yeah. It's fairly low. We have eight gigabytes on 128 gigabyte machines, and I also feel that this is a little bit conservative. Yeah. It's much more conservative. The second question is about Cinder or like volume detachments, because what we observe is when we detach a large number of volumes, for instance, from, I don't know, Kubernetes clusters, where then suddenly there's a bunch of volume deletions. We see races in the quota and reservation commits, which then lead to the timeouts that you described. Have you seen this as well? We've not seen anything from a scheduling standpoint. We've been able to track this fairly certainly back to the slow API response time we're getting from our back end storage device. I think you may be foreshadowing our next series of tragedies here as soon as we get past that particular bottleneck. Thanks. Over here? Just two quick questions. What ML2 plugin are you guys running? Are you guys still on Liberty or are you moving forward? We're still on Liberty. We have a Newton dev cloud where we're deploying right now to test all the changes, and I'm not sure of the ML2 plugin. Let me explain. Thanks, James. James is my network guy. He's my fantastic go-to guy in the back there. Hi. Networking questions. So your network is layer two, right? So my question is which protocol is the network? Is it trail or, you know, so that it's not a spending tree, right? So you mentioned that you're only 66% utilized. I wonder, yeah, what's the network protocol that is driving the layer two? So the benchmarks that we took that we quoted, the question was what protocol are we driving right now? The application itself just uses TCP sockets and some fairly large packet sizes as well. But the benchmarks that we quoted was NetPerf TCP-RR, TCP-Round Robin, that we used to just get a baseline benchmarking for comparison. Does that answer your question? Maybe not. I'm not sure. Right. It's really the protocol that, you know, that gives you the data part in the network. Is it a spending tree protocol, or a multi-spending tree protocol trail, or maybe something like that? So, yeah, I think you're probably asking something that's diving more into our physical networks and unfortunately we're not well represented that we're not well represented. We are using some, a networking technology that actually gives us a flat latency without, we're not using a top of rack switch architecture, and it gives us sort of a flat latency through our data center. So, we haven't had a whole lot of, that hasn't been the source of the performance problems. It's been more the on-host architecture that I think we've sort of feel that that's where the bottlenecks are coming in. Okay. Thank you. Thanks for a good session. For your compute nodes, did you consider using Novel XD as an alternative to KVM? So, that is, yes, but that is more of the next step. Like, can we use containers? Can we use Ironic? Can we do something beyond KVM? Okay. Absolutely. Yeah, because I think you can do that. Absolutely. Yeah, because I think you can definitely get a performance improvement without the hypervisor overhead, but nice thing about that versus bare metal is you get so much more flexibility because, you know... Well, that was one of the things when we started looking at this. There was so many deficiency with an Ironic itself. It's like we've generated a poor man's Ironic under Nova itself. Yeah. The shared networks, the multi-tenant networks, the single control plane, all the things that we needed to make this happen weren't available yet. Okay. Good luck. Yeah. Thank you. Thank you. When we introduced huge pages and CPU pinning, we did this mostly or only basically for performance reasons because we saw there's a clear difference between what we get out of the virtual machine compared to what we get out of the physical machine. Did you compare your virtual machines with physical hardware and performance-wise? What's the difference? Or what's the loss the overhead that you have with your virtual machines now that they run with huge pages and CPU pinning? Yeah. So regarding the performance change with CPU pinning and huge pages, honestly, given the scale of our application, we didn't have to benchmark it. It was either it works or it didn't. We can run our CPUs extremely hot when a customer request is in. And that was causing, as we mentioned, the CPU lockups to the point where it just wasn't making progress. And because we employ data parallel algorithms, if one of those locks up, it basically slows down the entire request. And then from the memory standpoint, really, the situation was because we weren't NUMA aware within we weren't being NUMA aware in our scheduling. It led to host-level imbalances. And you could look on the host and say, hey, this thing has free memory. And then you look in the syslog and see Umkiller killed your instance. And you're like, what the heck? Well, it kicks in even when one NUMA region fills up. Even if the host has room, if one NUMA region fills up, then that's, yeah. So back to your question. Yeah, the baseline was really just, is it even working for us? Because we run so hot, memory and CPU oversubscription is likely not going to be very good for us in general. So we've sort of taken it on that. Last question. Yeah. Sorry. We didn't dare to run with one gigabyte huge pages. So the benchmark that I've done is we moved to two megabytes. And one gigabyte was even better, but it felt like very large. Have you experienced any issue with these large huge pages? Or is it just working as like two megabytes or 4K would work as well? Because we would like to, I mean, the further we reduce the page table translation, the better for our application. This is why we went to larger huge pages. But one gigabyte felt like very big. So yeah, I guess regarding the page sizes, just to be clear, we allocated the pre-allocated huge pages at the host level. Within the guest, so we use those huge pages. Yeah, so do we. Okay, so right. We haven't experimented too much in terms of that. But if we look at our broader cloud, this is a strategy that we're using currently just in these very large compute instances. We may extend it to the other ones, but pretty much every VM we have is an increment of a gigabyte. We don't have anything smaller. So the fragmentation or other disadvantages with the gigabyte huge pages, we don't see a whole lot of disadvantages with that. Plus the one gig huge pages, it more modeled your data. Your data model itself, they have a lot of data they want to load into RAM. We did see TLB cache misses go down when we did one gig huge pages. I mean, of course, right? Because you have a bunch more RAM in there. Thanks a lot. Thank you. I have a question. How many instances you were starting destroying, like, per day? So this, Monica, that'd be from your testing standpoint. When we were going through the initial testing, finding all of these problems, it was a stand-up, a whole cluster, and then try and tear it down multiple times a day. Right. So we were using those, so a cluster, we did experiment with different sizes, but for the most part, it was 20 instances, so 20 hosts, and then we would have eight or nine volumes attached per host, so there you get the 180 volumes. But as far as the instances, yeah, I mean, we would try and create them and stand them up and tear it out, and we would make a dozen attempts in a day in under eight hours. And using, you know, we had to manually intervene to actually correct things, right, to keep it going, or to call Rackspace and say, hey, I've got a bunch of volumes in an error state. You know, can you help me out here? Again. Again, yes, it was constant. And we really could not move forward. We moved forward with that, so we had just constant all hands on deck trying to get through, you know, resolving these issues. And for us, this was a very popular product, and it was a big deal, and everyone was really watching it, right? So we had due dates, and we had to make it, and it was very well known for the entire company about it. So to continue to have these failures was really evident to our internal customers, you know, but ultimately our external customers as well with the people, we did have beta users out there testing it and it would impact them, you know, trying to deliver some kind of updates or just to have an online application. Yeah, these issues did a lot for eroding the confidence of OpenStack. It was the instability, the unreliability of it, and we've gotten a lot better at giving a stable platform. We just need to go that last, you know, last half a mile. So maybe just jumping back, your question, how often are we trying to do it and how many clusters is how much are we? How many instances? Right, how many instances per day. So we have a cluster size, right now it's replicated to two clusters, but then on a daily basis, the design being based on the Lambda architecture where we're essentially replacing that serving layer, we'd like to do it on a daily basis. We are nowhere near that, primarily because of all the attach and detach issues that we've seen along the way. We're still making progress on that. So from the end goal would be 30 instances a day would be spun up and spun down, but during testing it was hundreds per day. Okay, and so you did not face the issues with Nova's scheduler, right? No. No, okay, and as far as I understood, you faced issues with the center volume, right? Correct. And the slow messaging bus, right? Yeah, it was the API of our enterprise storage solution was not up to snuff basically. It couldn't handle the work we were trying to throw at it. Oh, so the bottleneck was the API of the storage backup? Yes, exactly. Okay, so not the messaging bus? No. That was yet another victim. Yeah, the messaging bus was a victim because if we would say detach these 20 volumes, we start a timer, and if it could only get to 15 before our timer's out, we have five volumes stuck in perpetual detaching. Thanks for the cool story. Thank you guys. Thank you. So thank you guys so much for coming today. Cool, that's it. Alright. Cool. Nice job.