 Welcome everyone. I'm Brett Milford and today I will be talking about optimising resource allocation in hyper converged infrastructure I'm going to provide an overview of the main categories of infrastructure architecture and some key considerations That you should think about when employing hyper converged architecture then we'll take a look at a few case studies that exemplify these considerations and to round out the talk, I'll briefly cover some tools and methods for interrogating these issues and We'll also provide an overview of the various controls at your disposal to manage the allocation of resources in your systems This will be by no means be an exhaustive look at the issue. In fact, I would have liked to have covered a few more topics in depth however, this talk should provide a good crush section of the topic and Hopefully you can apply these ideas to some other areas where you can count of these issues a Little bit about me. I'm a software engineer for canonical within the sustaining engineering team Sustaining engineering provides L3 break fix support to canonical's customers and as such we maintain a team with broad and deep expertise of all the software that we support so that's like OpenStack and server applications like MySQL and right down to the Linux kernel and With this role and previous roles that I've been in I've been an enthusiastic user of OpenStack since 2016 In fact in preparing this talk I was going back through my archive of old notes from when I was an operator and I found an upgrade plan for OpenStack Nataka from early 2017. So that was a fun trip So as I mentioned canonical supports numerous OpenStack clouds at scale With Ubuntu being the foremost distribution underpinning OpenStack deployments Sustaining engineering provides L3 break fix support and our unique position in the organization as a catchall Forges the opportunity to dive deep into really interesting problems and topics Yeah, so managing the breadth of Applications required for a cloud and their resources is a complicated matter at the best of times But it's particularly difficult in a hyperconverged in hyperconverged architectures So applications competing for resources can create compounding effects on resource starvation Or alternatively due to the vast interplay of applications and configurations and despite available resources You can account at edge cases where they just simply don't work So this is the aerial I'll be focusing in a bit more in this afternoon's presentation so I'm going to give you a brief overview of types of infrastructure architecture and I suppose roughly how they relate to charmed OpenStack Firstly, we have the desegregated architecture. This is where all your components like compute network storage and control plane They're all hosted on separate nodes. This allows the logical separation of applications by workload and hardware requirements And it also allows the separation of user and system workloads This also cleanly accommodates mixing hardware and software components. For instance, like having Dedicated network or storage appliances in the mix Next we have the converged architecture and here one type of node hosts the control plane components While another type of node hosts the storage and work and compute workloads This still permits system and user workloads, but it combines some of the functions and requirements from the previous architecture Then we have hyper converged. This is where all the functions compute network storage and control plane Distributed across all the nodes of the cloud These applications might be logically separated with containerization So in the instance in for instance in charmed OpenStack, we make heavy use of LXD containers This is typically a sought-after choice for general purpose workloads in order to maximize flexibility and utilization of cloud infrastructure And these three architectures they generally serve as a talking point for design implementation. However, in reality You know hardware and software components and converged and desegregated nodes are common in many clouds And in the near future, there might even be like a fourth architecture that might emerge when we see like the proliferation of DPUs for instance So when you're analyzing hyperconverged architecture infrastructure, I should say There's a few things that you should think about It's unique because you need to be aware of the many software components simultaneously and their characteristics and their interactions So some key facts that is good together when you're working with hyperconverged infrastructure are the per host application layer including the kernel the OS and Key system functions or applications that are running on a given node It's worth noting that there might be a number of combinations in a hyperconverged setup I you might have one node that hosts Novi compute a sephmon and my scroll and another node that hosts Nova compute a sepho SD and rabbit MQ It's useful to be aware of this and in some instance in the instances It's helpful to generate a service map So this map can be useful in narrowing down issues ruling out specific services or layers or Focusing in on the interaction between two particular applications So for instance, if you have an issue and it's reproducible across a range of nodes or a subset of nodes With different service maps, you can probably use that to help rule out certain applications and the interactions Or focusing on the interactions between one or two applications Some other key considerations are you might need to make common denominator Configuration choices when you're working with hyperconverged infrastructure unless your cluster is set up to manage the complexity of a heterogeneous cluster And as a matter of planning, you should Generally separate user workloads and system workloads Concerned and their concerns when you're like modeling your requirements So I would generally start by analyzing the control plane network and storage requirements first before I then Add in user workload concerns into a problem And a couple of more things that's always useful to think about is When you're looking at an application and its requirements, you should think How does it scale? Does it scale with user workload? And what's the relationship there? So we're now going to take a look at some case studies to help elucidate the methodology I outlined The first case study that I have is a set file store issue that It's not super uncommon. In fact, if you have users of CEP, you probably have already encountered this in the past The first issue The user prevented presented with Inamon and Kertle panics on their systems And a lot of Dmessage messages with Kernel-hung tasks, which look like Infotask blah, blocked for more than 120 seconds The software stack they were using is Ubuntu 14.04 There was a 4.4 kernel and it was Cep Jill The output of their free memory looks like this So as you can see there's sufficient free memory But the access that memory has been used for cache So what's the application? What is the application behavior of CEP file store? Well, the OSDs are implemented on top of a common file system in this case XFS and The OSDs make use of the page cache buffer for reads and writes Dirty pages are flushed frequently to disk and clean pages fill the page cache and they might linger if they're never invalidated This is demonstrated in the output of Proc zone or progmen info. Sorry So what's the problem here? Well reclaimed needs to take place to be able to use that memory that's currently being used for the cache Sometimes direct reclaim needs to happen at inconvenient times to satisfy a memory allocation request We can't directly set a point at which case swap D wakes up and performs an asynchronous reclaim But we can see when it will so Yeah, that's right looking. Oh, no, so that is men info. But if we look at Proc zone info We have these watermark values So when free pages drop below the high watermark asynchronous reclaim takes place until at least high number of pages is free And when the value drops below the low watermark an allocation and an allocation is requested direct reclaim Takes place and the allocation is stalled until enough memory is free to satisfy that allocation So what can we do about this? Well On Linux 4.4 not a lot There are values that you can set sorry the values in The previous slide actually Were set as multiples of min free k bytes So we could raise this value and then case swap D would wake up earlier to reclaim Reclaim the the cached data or the second option is to manually drop and compact memory and that's what We did in this circumstance Yeah, so our second case study that I want to look at is a swift on XFS issue So in this circumstance the user presented with the following issue their reads and writes their swift cluster were failing at a high rate With 503 service unavailable, so it wasn't a complete outage, but it was a severe degradation of service The software stack was you want to 1404 a 4.4 kernel XFS and swift mataka So in investigating this issue. We found a large number of container database. Sorry. We found large container databases The container database replicator was producing a high number of quarantined files and the replication of these container databases were failing in a number of ways, so One of them included lock timeouts when trying to replicate databases And there was also various database error leading to quarantine database files similar to the ones listed there We also witnessed a couple of XFS related errors So how does the swift container database replication take place well when replication takes place under the hood Swift will Sync batches of rows of the container databases when the difference between the databases is small When the difference between databases is large, it'll ask sink at the entire database and Basically like drop it in place so these databases were quite large so 25 gig each and there is actually I Don't know there was many many many many many copies of this spread throughout the entire swift cluster And so swift wanted to try and replicate those piece by piece basically And under these conditions it just was unable to complete So additional analysis of this infrastructure We noted that there was like high levels of memory fragmentation as demonstrated by a lack of higher-order pages Yeah, and Analyzing the memory usage further there was around 11.8 gigs of anonymous memory usage 32.3 gigs of file memory usage and the total reclaimable memory was about 52 gigs So reclaim should have been possible, but it was clear it wasn't happening soon enough In analyzing procs lab info the major contributors to usage were XFS i-node at 39 gigs and dentury at 4 gigs So this indicated two paths the first was we could increase VFS cache pressure to preference dropping Dentry and i-nodes when reclaimed takes place and the second was to force the reclaim of that memory so So yeah, so that's basically what we implemented it in in this scenario Yeah, kind of similar to the previous one so further investigation of this issue led to the identification of the root cause where Swift was unintentionally triggering an XFS anti-pattern whereby pending objects were Written temporarily to a file in a sweep in an attempt directory and then renamed to be moved into the swift directory hierarchy So this led to a disproportionate number of i-nodes in a single XFS Allocation group and that led to really poor performance This is fixed in a later version of Swift. However, that was unavailable in this environment So these kind of kernel tunables were essential in being able to bring that environment back to a working condition in the meantime Third case study is Was a recurring issue with a hyperconverged node which was also using huge pages The user presented with the issues with the following issues Basically creating open stack instances backed by huge pages. They just frequently failed checking, you know free memory and prokman info it seemed like there was ample Resources available for those allocations all those are sorry those instances to be Deployed onto that infrastructure So have a look at it the software stack. They had a 4.15 kernel The nodes were obviously running over compute but interestingly enough in this scenario We were able to reproduce this problem on many nodes across the cluster So in some ways it implied that it wasn't really specific to the services that are running on those particular nodes so initial investigation showed that instances were Failing or sorry failing with the follow following error so What's the behavior of Kwame you basically Kwame you starts a VM process It'll pre-allocate the memory for the instance, but it'll also allocate some memory for executive functions So the system had ample huge pages that it was using backing these instances with but it had a shortage of higher-order Pages for Kwame to claim for its own use in the hyperconverged architecture We allocate a portion of memory that's reserved for system use This needs to account for various applications with various usage patterns Including this allocation which comes from the system portion, but it's required and scales with the user workload So our initial solution had two paths. We could either increase the reserve proportion of men portion of memory Which in this circumstance was kind of difficult to do because we're already reserving about 20 year memory Or the second option is to reduce the utilization of the reserve portion of memory So drilling down with the usual tools it was noted There was no real particular processes that were using that much memory The largest consumer at the time was OBS and it was less than a gig so There was however a large number of Nova API metadata processes 161 to be precise and together they accounted for roughly 14 gig of memory Now the metadata service provides a way for instances to retrieve instance specific data by responding to the requests The Nova API metadata application serves the metadata API and routes its requests It's needed anytime you need metadata and a common time that you need to metadata is when you're trying to boot an instance So in a highly dynamic cloud you might need a lot of Nova API metadata services or processes to be able to service those requests But in a relatively static cloud That's not needed so much So either either way in this case 161 was probably overkill so he tuned this to a more sane value via juju configuration and We also noted that the charms adaptive configurations could be tweaked for this scenario to Yeah account for the infrastructure has been deployed on to So tuning the number of metadata services freed up a significant amount of resources Across the cluster and that sold the problem for the customer However, the problem returned a few months later and this time there was like no significant resources resource consumers So investigating proxone info yielded some inconsistencies With calculating min low and high watermarks those watermarks that we were referring to earlier So the gap between min low and high is calculated with either one of the well the max of either one of these Formulas which is the min free pages times the zone managed pages divided by the sum of managed page all zones or Zone managed pages times watermark scale factor divided by 10,000 On this node, which as I mentioned was allocating huge pages at boot proxone info showed the following It was evident that if you took the default value of watermark scale factor and echoed it back into proxess vm These values would change So how is the watermark scale factor actually calculated or more so like what is the logistics of it being calculated? so the first step is that When the system boots huge pages are allocated and they're done so by at boot by the boot memory allocator And when I say they're allocated, they're reserved essentially At the end of the boot stage the boot memory allocator transfers the remaining memory So not the huge page reserved to pages transfers remaining memory to the buddy allocator and populates the zone managed pages The watermark is then calculated whilst huge pages are still reserved by the boot memory allocator Then huge pages are returned from the boot memory allocator to the huge page free list adds to the zone managed pages so because of this the value of Min low and high watermarks at runtime Will be much larger than when they're calculated as a systems booting So ideally at least in this case Watermark calculation should be based on the memory excluding the huge pages Which cannot be used by the buddy allocator. However, this would have implications for Transparent huge pages and the values reported by free. So that would need to be addressed separately in this case in the case study in particular Despite the mismatch there was still evidence that Compaction thresholds were not being reached before seeing page allocation failures. So to address this we implemented the following tunings Yeah, we lifted Watermark scale factor and we lowered ext frag threshold Which reduced the index memory of Sorry the index of memory fragmentation, which is the point at which compaction is triggered so How do we observe these issues? Well, we have two primary categories of tools there. The first is tracing which is essentially making use of event-based records Tracing tools capture data points based on the execution of an event So application logs are a really classic example of this There's also dynamic tracing which involves tracing an arbitrary function in a running system But that can be quite finicky to implement The second option is sampling which is where you capture a set of data points at a given point in time as a snapshot You can then use multiple snapshots to build a profile of data points over time So let's take a look at some specific tools. Oh, yeah So we have ps which provides a snapshot of process information running on a system It's a sample of one, but it can still be highly useful So with some shell techniques, which were highly useful in the case studies that I showed you can basically filter and sort and Get a rough idea about what process processes are using what in your system and You know add up them add up certain processes or even like calculate the number of different processes or yeah The only problem with ps is that short-lived processes are particularly hard to capture So that's why you need tools like exec snoop, which can be useful at capturing Information of these short-lived processes by trace s by tracing the process execution rather than sampling it Next we have sys stat which is a sample based tool and it can provide a profile when demonized or run intervals So for instance, we could view the memory statistics on my home server for yesterday Sa provides a really good starting point for locating issues, but it's generally Generally its granularity is sort of too lacking to actually capture acute problems So for this you'd go to a tool. It's a bit more in-depth called perf, which is quite useful so Perf is a sample based tool as well, and it can generate a lot of data it's not suitable to run constantly due to the performance overheads and This would leave you with a significant amount of data to sort through when you're done, which may or may not relate to your problem If an issue is intermittent The operator needs to be on the lookout for that problem behavior and then capture be ready to capture that issue with Perf So to assist with this there is a simple tool To help capture Perf data when problems might arise so it uses, you know simple heuristics like a spike in CPU to trigger Perf to record start recording Yeah, Perf data can also be useful and you can produce flamegrass with it like the one in the back of the slide and Finally, we have BPF which can refer to a number of things Originally, it was the Berkeley packet filter implementation but in modern times it usually refers to the extended implementation in the Linux kernel and It can also sometimes be people can sometimes be referring to the suite of compilers libraries and tools that make use of BPF as well These include the BPF compiler collection and BPF trace So making use of BPF requires a relatively modern kernel at least 4.9 however, then you're the better but in all but one of the Case studies I showed we didn't actually have that available to us And so for operators this should definitely drive the case the business case to upgrade as soon as possible But yeah BPF has both sample and trace capabilities BCC and BPF trace provide a bunch of generally useful tools. So I mentioned exec soup snoop before that's a BPF tool part of the BCC collection and However, the ecosystem around BPF tooling is also quite advanced and you can just it's actually quite straightforward to develop your own So for our use case in sustaining engineering We think that a combination of go BPF with BPF type format looks like a really promising way to develop tools for specific scenarios With the minimal administrative overhead that you usually get when you're trying to do k-probes and similar things Now I'm gonna have a look at some mechanisms of resource confinement and their management interfaces So starting at the apple we're starting with application level controls So we had a bit of a look at Nova There there's probably more than this to be honest, but like two Particular controls that or configurables that stand out of a course the reserved host memory which signals to the Signals to Nova to reserve some portion of memory when making scheduling decisions and Is essential to maintain some kind of partition between user workloads and system workloads The second one is the metadata workers, which as we saw can be quite a drag on resources if they're over committed and They should be set appropriately for the environment This also serves as a timely reminder that scale related configs can be quite essential and That you should probably I suppose check that they're appropriate for your environment if you're using Ceph and With Blue Store Ceph OSD memory target is a really good setting to consider It's a best effort setting, but it's still helpful in managing co-located services. Basically it Sets a target for the OSD OSD processes to try and not go over Yeah, and if you're using MySQL I know DB buffer pool size is an interesting tunable It's actually tuned quite conservatively by default So it's unlikely to cause you any issues by default But if you are increasing that value to say get it some more performance out of the system You also need to consider the impact it would have for the reserved portion of memory And yeah other parts of your system So the next level down you'd be looking at namespaces and C groups Namespace is an abstraction that makes a process appear to have its own isolated instance of a resource So some frequently encountered namespaces are the username space which maps UID and GID values The PID namespace and a network namespace Noticeably absent is a storage or device map a namespace Which means that controlling the device mapper from inside a container can be quite tedious and fraught with danger Yeah, so you can run a Names you can run an application in a namespace with NS Enter and you can also list namespaces with NSLS the next I suppose configurable you have is C groups and They're essentially a method to organize processes and distribute system resources in a controlled configurable manner Processes are organized into a tree structure and various controllers act on those of this tree structure to I suppose impose limits or yeah do whatever you want Yeah, and in Ubuntu We mount C groups to the CISFS C group and you can interact with with those C groups in the example like I've put up here So it's likely that you're probably already using namespaces in C groups By another in management interface in particular, maybe system D. Maybe something else System D has its own as well as terminology, but essentially maps to the same thing So for system D a C group C groups are modeled with system D slice units and it's a concept Much the same for hierarchically managing your resources of a group of processes System D also has tools for managing C groups or certainly introspecting them so system D C GLS and system D CG top and system D also contains a tool system D and spawn which allows you to spawn Processes or turn OS trees spawn or use OS trees in a namespace So and spawn system D and sport couple of these system D slices They provide the tools to manage and inspect namespaces and C groups on their own You may also be aware of LXC and LXD So LXC is a user space interface to the kernel containment features and LXD is a container in virtual machine manager And so you can create containers with LXC create or LXC launch And it'll set up the namespaces in C groups for you for the resulting container Finally when we take another step down we have global kernel tunables They're used to alter the behavior of the kernel at runtime and a control via proxys As the case study is primarily focused on memory problems We'll look at tunables in that area The relative relevant tunables reside in proxys VM where VM refers to the machine abstraction that the kernel presents to processors We've made heavy use of two tunables today The first was min3k bytes which is used to force the Linux VM to keep a minimum number of kilobytes free This number is also used to compute the watermark values for each zone in the system proportionally This is useful as we can in tune async reclaim to take place sooner But the trade-offs are that will likely have a lot of unused or at least some unused memory And there'll be more frequent async reclaims The second option is watermark scale factor which defines the amount of memory left in the system before kSwapD is working up And how much memory is needs to be free before kSwapD goes back to sleep This is useful as we can tune async reclaim to take place sooner But the trade-offs are kind of similar to min3k bytes but with less wasted memory And it's only available since at least in the upstream kernel since 4.10 So that made it kind of unable to be used in many of our case study environments So We've covered the conceptual differences in infrastructure architecture and the key considerations for analyzing hyperconverged architecture We've had a look at several workload case studies We have covered the use of basic tracing and profiling tools to improve observability And we've also covered the mechanisms available at various levels to manage the behaviors of applications within our hyperconverged architecture As an operator there can be a lot of work in conceiving an idea of a cloud and delivering a fully functional and efficient service As demonstrated and this is just the tip of the iceberg. There are many paths to traverse when locating and diagnosing and fixing issues There are many configuration choices to manage and they go through the entire range of the application stack So great care needs to be taken to ensure these applications coalesce with each other and use workloads in an efficient manner so the open stack charm package Sorry, the open stack charms package the operational knowledge required to achieve this Intern that leaves it alleviates much of the operator burden and complexity They also provided unique opportunity to optimize configuration for the target environment So when we were investigating these issues and considering the numerous ways that we could improve these problems for a broad user base We decided to introduce a spec and implementation to the open stack charms to configure watermark scale factor appropriately for the target infrastructure in the future we aim to make to take more of these insights and develop and push them back into the charm operator ecosystem and Hopefully this should improve the performance and reliability of these systems for all our users Before I finish I should give a brief shout out to Gavin go who is a senior software engineer with sustaining engineering and he informed much the kernel side of this talk Yeah, and Also, there are if he slides made available I've got a bunch of links at the end to like helpful resources that are really good for profiling and looking at these sorts of things Cool. Thank you. I don't we might have time for questions. Oh, I might not Say I won't have any questions. Perfect. All right, thanks