 Welcome everybody to the talk optimizing software defined storage for the age of flash. The presenters are the presenters are Krutika Manoj and Raghavendra. Give him a big hand. Thank you very much. So we are the last talk keeping you I guess from your dinner that's always a good enviable position to be in. So the talk is optimizing SDS for the age of flash and basically you have a lot of fast SSD devices coming in to general use and we are looking we are we work at Red Hat with the software defined storage called Gluster it's there at the bottom of the slide and the talk is about optimizing the software stack to get the most out of the the performance that the hardware devices are capable of. So since there are multiple speakers I'll just give you a brief overview of what to expect over the course. I'll start off with an introduction and as the performance engineer I will present sort of a performance-centric view of the problem that we are trying to solve and in order for this part you don't really need a whole lot of background about Gluster, details, the architecture and things like that. So after that Krutika is going to come in with the Gluster architecture you know the flow of the request through the stack the client and the server side and start describing some of the enhancements that we had to do in order to improve the performance of Gluster for this particular use case. Raghavendra is going to continue with a focus on the RPC layer improvements that we had to drive in order to get better performance out of Gluster and we will conclude with some lessons learned and some of the work that is in progress in this area. So Gluster has been around for a while and many of you probably know about it and the traditional strength has been for use cases that are more large file sequential I oriented so you have you know storage for CCTVs or crash test videos or IPTV or backup a lot of those use cases and typically you would use you would back the storage with spinning drives with good you know cost per gigabyte characteristic right so you have a lot of data at the store and the performance is good for these kind of workloads and that's kind of been the strong point but there are some trends coming up on the device side you have SSDs becoming commonplace in data center deployments right and some of these especially the NVMe SSDs can be really fast in terms of the IOPS that they can deliver and this is one area that hard disks have typically been not so good right so if you have a small random IO workload the seek overhead so just keep your spinning drives from delivering any kind of good performance and those are that's exactly the the area where these flash devices are really good at right and so in terms of IOPS capabilities some of these NVMe SSDs can go 300,000 IOPS right so whereas a typical hard drive would be something around 200 or so right not 200k it's just 200 versus 300k or more the other thing is that as Gluster some of you may know is integrating with things like Kubernetes solutions like Kubernetes container platforms so the role of the solution changes a little bit it goes from being sort of a scale-out solution targeted towards a particular use case like somebody has an IPTV workload they want to back up with Gluster so they use that so instead of that kind of a role the role of Gluster is shifting a little bit towards providing a storage infrastructure layer to a larger cluster right so there you don't have you are not targeting any specific workload I mean you have whatever workload is running in the in the cluster you have to support any and all of those right so you cannot be picky about what you support well and and don't support so it's important for Gluster in this environment to be able to support a larger variety of workloads and some of them happen to be you know databases you know things that are IOPS centric and workloads that Gluster previously did not have to worry about too much because people were not so interested in putting those workloads in top of Gluster right and once we started looking at this combination of you know how does Gluster behave for these kinds of workloads on these kinds of devices we did notice some deficiencies and the path that we took to optimize the solution that's what the talk is really about and this is very much a work in progress but at around the time that the submission for the for foster was due we figured out that we had enough material that it would be a good time to go out and tell people about some of the for some of our experiences and observations and you know maybe get some feedback from you guys as well so just for the sake of comparison what what we're doing here is just taking an basic local file system Linux XFS file system it is backed by a local NVMe SSD drive and we run a random IO small random IO test on it and this is what we kind of get right so what you see is you so on the x-axis what we're varying is the number of concurrent requests right so the number of concurrent jobs that are issuing IO requests and that we are increasing and then when that number is low if you just have a single threaded application doing small Ios there is a limit to what you can get in terms of IO right so it's basically dictated by the latency of the stack and even for a local file system you will not be able to push something like an NVMe SSD to the limit but as you increase that number as you increase the IO depth or I'll use the term interchangeably for the number of concurrent requests you will see that the IOPS will steadily increase up to the point where the device saturates right so you you steadily increase to the point of device saturation and at the peak you can get something like 300k plus IOPS so notice that the y-axis here is IOPS in K which is you know 300,000 in this case right so this is what we would expect a good solution to deliver and obviously we don't expect Gluster as a scale-out distributed solution to match the performance of a local file system because it provides so many more features but this is kind of from a just a performance point of view what you would expect a solution to look like at least in terms of the shape of the of the curve and just for reference I'm going to put up the test that we are that the previous slide was based on and the important thing here is there are a couple of parameters here this is an FIO job file for read and write a couple of parameters here that we are varying to change the IO depth and the number of concurrent requests that are in flight and one is the number of jobs there's a number of FIO jobs and for each FIO job there is an IO depth parameter which controls how many outstanding requests it can have at any time and that is possible because we are using an IO engine the LibA IO which allows asynchronous IO a couple of other things the block size is just 4k which is small random IO like I said we are using direct IO because a lot of these kinds of applications tend to use that direct IO to do the job right so so I'll move ahead and and the configuration that we're evaluating here the numbers that are based on is is fairly high end super micro system is just that there's not a whole lot of storage you are we are particularly interested in performance on NVMe SSDs the fastest SSDs and so it just has a single NVMe driver system and we are looking at a fairly recent release of of Gluster and some of the enhancements that are still that that we used to try out these the performance enhancements and they will eventually make it in but they have not been done that yet and Gluster as Krithika will explain has got a number of sort of optional what we call translators that you can plug in that add functionality to the to the SDS so you have translators which do read ahead and right behind and caching and things like that in this particular case it has been tuned for random IO which means that there's a parameter called strict and OD rack which make sure that OD rack is respected properly on the client side and also remote DIO is disabled which means that direct IO goes all the way to the brick right that is respected all the way to the brick most of the reading read ahead caching translators and things turned off in our test so if you so this is back when we started or this is the baseline performance before most of the enhancements that we're talking about where in place and you see that the top line is the XFS read performance that we talked about earlier that that graph and on the bottom we have the same test done with Gluster and you see that you know that doesn't look too good so one thing is that as you increase the IO depth you are not able to scale up to the device what the device is capable of in terms of IOPS right so you're flattening out at something like 16 to 32 IO depth and at that point the IOPS that you're delivering is just a small fraction of what the device is capable of right and so we'll use this as a well defined problem for the rest of the presentation to see what we can do here and what are the changes that we have to do in order to make this better right couple of things to mention here one is that like I said for sequential IO this is a specific workload small file random IO for sequential IO the request path is somewhat different and you will see that Gluster will perform well even with NVMe devices so typically you were able to deliver what the device is capable or what the network is capable of the other thing is if you run this on hard drives because the device capability is so low in terms of IOPS you would not probably notice that there is such you know that that Gluster performance is not up to the mark compared to even if you compare it to a local file system because both of these will be able to saturate the device quite quickly at deliver the IOPS at that level right what else so couple of other mitigating factors like I said there are new workloads that people are wanting to run on solutions like Gluster but right now what we're seeing is that most of them are not really interested in pushing the performance envelope right so they are new things are coming in but those are mostly applications where performance is not that sensitive but that's probably not going to remain that way for long time just once people start migrating the more performance sensitive applications come in as well so we have some time to get our act together and fix some of these problems right the other thing is in environments like you know in the Kubernetes environments people you might know about Hackety and how that serves our cluster storage to Kubernetes clusters sometimes devices these devices are carved up into multiple Gluster volumes and mounted in multiple places each volume so when you split things up like that the acute the aggregate IOPS that you can get is is might still be quite high so some of the the numbers that we are here that we're looking at here is basically a single client trying to push a single server as much as possible so so there are some mitigating factors but even so this is an important problem for us to solve and we will look at what worked and you know some of the lessons that we learned here and I will hand it off to Krithika to continue with the Gluster architecture and some of the enhancements that we made clarifications I'll be happy to there's by our fight this is single client connected to a single server and so what is the network look like what is the network this is the client server configuration right so so in this particular case we have 10 gig Ethernet in between the client and the server and now so this is about so it's a it's a 4k IO right so if if we are talking about so so with 10 gig you can go up to about one gigabyte per second in terms of throughput right so this is still about less than 200 or so I think so so network is not the bottleneck in this yeah but I should have put that on the slide what the network is oh just maybe it's too late but can you define IO depths once again what is the IO depth okay IO depth is a term that you so so if you are using FIO which is a popular tool for benchmarking right so FIO job if you're using it with the with the LibA IO which is the asynchronous IO framework IO depth allows dictates how many outstanding IO's you can have at a time right so it will so if you have a job with IO depth equal to 16 FIO will allow you to have 16 outstanding IO's before blocking more IO's being sent out so it's it's a measure of the number of concurrent request that you can have that that the system will experience at any given point in time right I'd like to take some minutes talking about what is Gluster and the Gluster architecture as I believe that's relevant to understanding the rest of the talk so what is Gluster so Gluster is a scale out or distributed storage system it aggregates storage across multiple servers to provide a unified namespace so for example if you have five servers each of which has around say one TB of storage attached to it what Gluster does is to aggregate all of these all of this into one single one TB times five which is five TB namespace and your applications data gets distributed across these five disks in a way that is transparent to the user so Gluster has a modular and extensible architecture we achieve this using what we call as translators I'll talk about translators more in the subsequent slides so Gluster is layered on disk file systems that support Xnet attributes so what this means is that it can be layered on most on disk file systems that are available today and Gluster has a client server architecture where the client reads requests from the application does some processing and sends the request to the appropriate server and the server is the process that actually executes the actual system call that the application requested and then returns the response so let's talk about some Gluster terminology so first we have break so a brick is a basic unit of storage so it is nothing but file system that is exported from a server and therefore it has two parts to it one is the machine where it is exported from and then there is the path to the file system that is exported and then we have the server servers where they export your bricks a group of servers are called as a trusted storage pool in Gluster parlance then we have the volume volume is a namespace represented as a politics mount point it is basically nothing but a collection of bricks so it is the volume that is mounted on the client to access the data on the Gluster cluster so then we have the translator is a stackable module with a very specific purpose so in Gluster architecture all the different translators are stacked one on top of the other to form a graph so each translator receives request from its parent translator does what it needs to do and then sends the request down to its child so I'll explain a bit more about that with a picture so this is what a very simple Gluster translator stack looks like so on the left we have the client side stack and on the right we have the server side translator stack and two are connected by network which is represented by this dotted arrow in the center so at the top we have a fuse bridge translator which is responsible for reading file system requests from slash dev slash fuse and then it transforms the request and sends it down to its child translator so the operation flows from fuse bridge all the way down to the rest of the translators in the middle and then finally it reaches client translator which is at the bottom of the client stack this is the translator that is responsible for sending the file operation over the network and on the brick side the request is received by the server translator which again desets the operation into all the different parameters for the file operation and then sends it down to its child translator and this way the file operation flows all the way down to POSIX translator which is the module that actually executes your file system call so if you are application requested to create then it's POSIX at the bottom which executes the create system call to create your file and then this is executed on the on-disk file system that's layered on top of Gluster and then it returns the response and the response flows all the way back up from POSIX to server and then over the network to client zero and then it reaches fuse bridge again and then it writes the response back to defuse and which is returned to the application so I'll talk a bit about Gluster threads and their roles so first we have the fuse reader thread so fuse reader thread is actually part of the fuse bridge module that I talked about at the top so it serves as a bridge between the fuse kernel module and the Gluster of a stack so it translates your IO requests from defuse to actual Gluster file operations which we call as FOPS and it's at the top of the Gluster stack and at the moment we have just one thread that does all this then we have IO threads so IO threads is a thread pool implementation in Gluster so the threads process the file operations sent by the translator above it so what it essentially does is to maintain a queue where all the requests get enqueued and then we have a bunch of worker threads that pick up these requests in parallel and start winding them down so the number of threads are scalable according to the kind of load that is there so there's a lot of load then all the default number of threads will be active otherwise it has the capability scaled down to as many threads as we need so in the case of parallel requests IO threads help in winding all of them in parallel in contrast to what would happen if for example in the stack there is client IO threads on the client stack right so if client IO threads were not there then the single fuse reader thread would have to pick up the request and send it all the way down to client zero and then come back up and then pick up the next request so with client IO threads that long round trip gets avoided in the sense that the request just gets queued till client IO threads layer and then it just goes back to reading the next request so we have IO threads on both the client and the server stack so on the server it is located quite close to the actual server translator it's labeled as server IO threads so we can have a maximum of 64 IO threads this is a configurable option in Gluster then we have event threads so event threads is also a thread pool implementation but at the socket layer so it is responsible for reading request from the socket between the client and the server so this thread count is again configurable and default is two and it again exists on both the client and server so it's executed at the client translator and the server translator level so if you put all these threads into one single picture you will see that at the top we have a single thread that reads all the requests and then parallel requests get distributed at the level of client IO threads and then when the request flows to the server side again when we have parallel requests the server IO threads can pick up multiple requests in parallel and wind them down and the response processing again can happen in parallel at the protocol server and protocol client layer where we have multiple threads receiving parallel requests over the network so it might seem like with all these threads we should be getting good performance but unfortunately that's not the case so but what we found was that all this multi threading was sufficient to saturate the hardware when we were using spinning discs in the bricks but with NVMe drives we found that the hardware was far from saturated so we set out to understand why so as part of this we had to figure out whether the bottleneck was on the client side or the brick side in order to do this the same FIO job was executed from two different cluster clients on a single brick volume to see whether the number of IOPS increased so we saw that from 30,000 IOPS we were able to get 60,000 IOPS which meant that the brick was able to deliver all these requests but it was the client that was not able to send enough requests so then we decided to concentrate on fixing the bottleneck on the client side so one thing that was very clear from this experiment was that with multiple threads and a lot of global data structures there is bound to be lock contention which can slow down performance so to debug the lock contention issue we used a tool called MuteRace so MuteRace is a Mutex profiler that is used to track down lock contention so it provides a breakdown of the most contended Mutexes so it gives some useful information like what are the top ten Mutex locks that were most contended, how often every lock was locked and how long during the entire runtime the Mutex was locked and stuff like that so we ran MuteRace on the client side okay I'll probably talk about that a bit later so I'd also like to talk about the debugging tools that already exist in Gluster today for performance so there's a volume profile command which provides per brick IO statistics for each file operation so this includes information like the number of calls and the minimum and maximum and average latency for every file operation so this is implemented in a translator called as IO stats and we loaded this IO stats translator through by hacking the whole file at multiple places on the stack to see the latency between two translators so this experiment indicated to us that the highest latency was coming from the bottom most translator on the client stack and the top most translator on the server stack which is the protocol client and protocol server translators which operated the network level so I'll talk a bit about the enhancements that we made in the process so first we have fuse event history this is something that was detected by MuteRace tool it appeared at the top of the list of most contended logs so fuse bridge maintains a history of most recent 1k operations it has performed in a circular buffer so it tracks every fob request in the every fob in both the request and response path the problem with this data structure is that it is protected by a single mutex lock and this was causing a lot of contention between the fuse reader thread which was operating in the request path and the client event threads which were operating in the response path so we fixed this by disabling this feature by default since the feature itself is not much useful unless you have to debug some errors or some file system issue so we disabled event history and found that this was the impact of disabling fuse event history so the random read IOPS improved by around 10k and the random write IOPS improved by around 15,000 so then the next thing we did was that there was another RPC layer fix that we added in the middle and combination that Raghavendra will be talking about a bit later so combination of that work and fuse event history patch gave was not so when we tested with the combination of these two patches we found that at one point the fuse reader thread started consuming almost 100% of the CPU and this meant that we could not proceed further unless we fix this so to work around this we added more reader threads to process requests from deaf use in parallel so the impact of this fix was that we got around 8,000 IOPS increase with four reader threads so then it was time to use mutress again to see where the bottleneck was next and we actually used so in with an IO depth of 16 we saw that mutress was reporting iobuf pool at the top of the list and then we again increase the IO depth to 32 to see whether the contention indeed increases and it did increase so the contention time almost doubled so this meant that we had to fix the iobuf bottleneck so what exactly is iobuf so iobuf is a data structure that is used to pass read write buffer between the client and the server so it is implemented as a preallocated pool of iobufs as with most data structures in GlusterFS towards the cost of doing a malloc or a free every time this again is a single global iobuf pool which is protected by mutics lock this meant that both the fuse reader thread and the client even threads again contained on the same lock not to allocate or deallocate an iobuf so to fix this we begin with the quick fix to see what is the impact of decreasing this contention so we created multiple iobuf pools in the code and for each iobuf allocation request we made changes in the code that it would in a way that it would select one pool at random or using around robin policy so the lock got kind of striped so instead of all threads containing on the same lock now we had contention distributed across multiple pools so more pools implies fewer contentions so with this fix we ran the same FIO test again and saw that the random read iobufs improved by about 4k and the random write iobufs by around 10k so this might not seem like a lot of improvement but it was vital for vital to thank you no problem okay so I'll now let Raghavendra talk about some of the RPC layer improvements good evening everyone so so the RPC layer is basically our own clusters custom implementation of son RPC RFC so for every connection a socket connection between the client and server there's this RPC library loaded on client as well as the brick so and since the sockets are non-blocking we use ePoll based polling mechanism and there as Kathika mentioned we have a thread pool which we call as event threads which is global for the entire process in the sense that if there is a DHT client which is speaking to five bricks so the thread pool is basically reads all the requests from all the five responses from all the five bricks and sends request to the five bricks so as Kathika mentioned earlier we the cluster volume profile information showed high latency between protocol client and the servers which basically means that something is not right in the RPC layer which can also mean that there might be some issues with our implementation of sockets or rather how we are using the sockets for communication so the first thing which occurred to was basically we went and saw okay in the request submission path we acquire this lock and in the replace submission path also we also acquire this lock the same lock so basically we might be containing on the same lock which which might be hampering down which might be making the request and responses content on each other thereby slowing down the performance but unfortunately when we made the fixes and ran the test again there was no gains so this is basically one of the key things we faced in debugging the performance issue so we spent quite a lot of time on this but it turned out to be a not much of an improvement so thinking about this in the meantime an unrelated problem basically which was a performance problem on erasure coding translator so basically there were bugs complaining that the performance was not good with the erasure coded implementation and as you all might be aware of there's a good lot of processing happening in the right code path for erasure coding so you have to compute all those matrices and a lot of other stuff so there we had made a fix and it had given significant boost in the performance so if I quote Manojit's it was probably on three to four X of performance improvement back in that time and that enhancement was already present in the three dot 131 baseline which we are evaluating for this exercise so so to to recap how this polling model works basically we'll get an event that there is a message incoming message since the EPOL based mechanism is a one-shot mechanism till we add back the socket for polling by calling EPOL CTL system call we won't be receiving any more events from the socket which means that any messages coming on that socket will be waiting till we add back the socket for polling so the fix which I just spoke about what it made was it tried to reduce that time window where a particular socket was out of polling for incoming messages so since it had given a significant boost that that's that was one hint basically maybe the bottleneck is while reading the messages from the socket so while we were discussing that problem so we thought of basically high validating this hypothesis using a test basically so all the while we were testing on single brick and a single client model now we thought of scaling to three big distributed model so that you'll have more number of connections and luckily for us it gave performance benefits so which kind of pointed out that maybe the single connection is not is not is a bottleneck and we need to add more concurrency there so one of our colleagues million suggested that probably we can have multiple connections between the same client and a brick so till now we had only one connection and that multiple connection effort gave us the same benefits as the three brick distribute improvement so so to recap these are the improvements we did one is basically the event history buffer we disabled it the second was adding up concurrency in the while reading the request from the fuse kernel module by scaling up the fuse reader threads the third enhancement was eliminating the law conditions in the IO buffers and the fourth enhancement was basically adding up more connections between the single brick and the client so after all these enhancements we got the random read IOPS peak down 70 K IOPS as compared to approximately 30 K IOPS when we started earlier and for the random writes it's it's saturated on the 80 K IOPS as compared with the 40 K earlier so one point to note is that this is very much a work in progress so we are expecting further improvements and post this conference also we are going to concentrate and carry forward this work so so one of the main things is me being developer for most of my life there were some insights into how people debug the performance problems one thing is when Manoj and Krithika showed as showed me the lock intention mutated output basically I could see there are lots of locks which are highly contented can now the question is should I fix all the logs in a typical software like GlusterFS you would expect at least tens of maybe hundreds of logs in different core paths so should I go on go after all the contented logs so how do I differentiate the logs which basically bringing down the performance and other highly contented logs which basically not affecting the performance at least at that particular point in time so thanks to Manoj so he again pointed out that maybe we should try out with multiple data sets of results by altering the parallelism so this is what the graph looks like so we are considering two locks here one is the IOFO lock and the other one is the memory pool lock so the memory pool is again basically pool of very commonly allocated objects so as we can see IOFO is the highly very highly contented lock and the second place was memory pool lock so there are two parameters here one is basically the numbers there the threat 6 and 6 right so one represents the client event threats and the other represents the fuse threads so when we scale that from 6 to 6 to 12 plus 12 we saw the IOFO lock contention growing up but not the memory pool lock so we didn't see the contention growing up even though we have increased the scale so which either points that at this point in time IOFO lock is the one which is bringing to the performance memory pool lock is not the one so that's how we arrived at let's fix the IOFO and go ahead with the test so the next thing is when you have highly concurrent loads multiple threats are necessary even for a very lightweight task if you observe the work done by the fuse radar thread is very small which is basically reading the request from the fuse dev fuse and queuing it to the IO threads so initially what the assumption is that probably it's not doing the much and it can low pick up that concurrency but that turned out to be wrong so we got performance improvements after scaling up the fuse radar threads so the next drawback we faced was basically while debugging the while trying to collect the information about the lock contention mutates itself added some overhead and that potentially skewed the bottleneck information so so what the course of time we found that there are multiple bottlenecks and sometimes you might fixing the bottlenecks but not seeing the results because that bottlenecks there are bottlenecks which needs to be fixed before that to see the results of the enhancements so for every fix we make we have to go ahead and re-take the all the enhancements we have discarded and see whether they become visible at this point in time so one best example is basically the three three big distributed test what we had done right so we had done that even before disabling the event history but unfortunately that didn't give the performance improvements before disabling the event history so once we after disabled the event history turned out that the same test gave better results because the earlier the bottleneck was the client event history so once we remove that we can see the the other hand announcement also giving the results so you could have seen that the gains were small the gains is not like a one big which gave us like 25k IOPS or something like that so it's more small gains adding up to a significant number so simple tools like sista utilities like top gave good insights like you can observe that the CPU utilization of a thread is peaking it's reaching the hundred percent probably it's time to add more threads yeah so we we had our share of micro optimization that was one was the efforts adding more concurrency between request submission and reply reading in RPC so again the stepping back and having a high level view of the architecture and coming up with a model like a three-brick distributed test it helped us to validate the hypothesis even before trying out the fix actually so the point is we don't have to come up with fix us already if you have correct models so the future work so as we have been pointing out it's very much of work in progress bottleneck analysis on the client and the brick stack is still to be come is not a completed work yet the work till now has concentrated we were concentrating on the client another interesting thing we observed was when we scaled up the fuse reader thread we were there was significant CPU cycles consumed in a spin lock so this pin lock is basically acquired in the reading code path of a slash nephew so we need to figure out like what's that and why it is adding consuming the basing of the CPU cycles so and the lock and tension work the especially the mutex is not is not a complete work so the next lock we see is basically the I know table lock and it is contented quite heavily and of course we need more lightweight tracing tools and probably we need to do some work over there whether improving the existing new trace or a new tool we don't know that yet since we encountered some inefficiency the RPC library library probably we can evaluate RPC other RPC libraries like gRPC whether they do better and there is this 0 copy idea which is basically to use place so that you don't copy the right and read payloads between the kernel buffers and the user buffers since cluster of AC is a user space based file system so when you read from the kernel a request and write the same data to the network to transfer it to the brick basically there's this back and forth copying between user space and kernel space supplies helps us to avoid that so there's a work in progress we need to see how it will help and of course all this work has to be merged into the master and shipped which in itself will take considerable efforts and the efforts can be traced in the bug link bug Zilla link given below so any questions yeah okay so I guess also ready I have a question thank you for holding this microphone so thank you all three of you I enjoyed the talk and also you work on cluster of s I have a more generic question about cluster of s I was wondering how elastic it is these days so how well does it handle bricks and nodes coming and going I mean do you have dips in your graphs or is it worse or is it no issue if if so if you have brick how highly available it's cluster of s so can I deal well with bricks and nodes failing and that kind of stuff can give me an update on this yeah so so basically we have two higher availability related translators one is basically the AFR which we call replicate automatic file replication synchronous application module so if bricks go down so if the code number of bricks are still up it can still service IOPS so that's one thing and of course one common complaint with automatic file replication it's because it's a mirroring solution so it had if you have a three-way replica it would have mirrored all the three nodes so naturally people complained about wastage of the storage space so the next solution we had was a reser coding so that again brings in high availability to the stack so so that's how cluster of s handles the node shutdown scenarios so I'm what's wondering have you tried this is all you're running FIO in a single mount point have you tried mounting the same export multiple times and running FAO across multiple instances because you would get multiple fuse instances in that sense and does it scale better that way with multiple directories so the same volume mounted multiple times and multiple mount points yeah so so that's what I was saying I was I don't know if you caught it I was talking about some of the mitigating factors right so in environments things like that happen in Kubernetes environments where multiple volumes get mounted on multiple mount points the same volume getting multiple mounted in multiple mount points on the same client that tends to sometimes distort the picture in terms of the free space that the client has available and things like that so I mean those those might be workarounds but I think it's important to solve the basic problems to the extent that you can right so that's important as well but there might be those might be workarounds that might work for some specific users so that that's that's that's good yeah we should not stop people who need to catch a train or bus from leaving so if you still have questions come and ask them from the speakers at the podium and let's give them a big hand for the presentation