 Hi guys, we are glad to introduce our block storage service based on safe and over stack We had a spender two years to build the service We have many experiments to share with you. My name is Yong Zhe. He is Haomai The content the contents I have four parts. Firstly, I will introduce Our block storage service future and million deployment architecture I also will show how to integrate the open stack app safe Secondly, how may we will introduce how to make high performance we optimize all IO stack certainly Certainly, I will introduce how to set a safer crash map to for high durability we had a runner mainly set for Several class in the past year. So in the last I will share our experience on several operation first part block storage service The sliders so our block storage service features we have a high performance and stronger Durability we provide a real type name shot and two oil types How to build a block storage? service The table is all the solar values in the block storage service We will update us so we are watching to get the new the future and the high performance Next up, I will show the our million deployment architecture Dominion deployment have a travel OSD node us and three more monitor node us The node I have two 10 GB network The one is used for storage network. The other is the used for VN network Our deployment architecture is very easy to to scale out You can add a travel OSD node us for a spender set for Capacity You can you also can schedule to many in the node us Then we focus on open stack. How to integrate the safe and open stack The native open stack will have a two dessert Advantage the slow the slow instance of creator action and the both store because no one can build a near to Download the whole day made image by HTTP. How to optimize it The method is using the safe safe poor if we create a new the instance The Noah computer will know Download the whole image by HTTP. The instance will can directly read the Images data from safe by TCP The instance only near to read a few data to both. So it is a very fast we can see QoS is a very very important for cloud market Tenants so we back quite a kill me slaughter into the 1.2 version We use the same we use a sender multi-pac and to provide two Value times the performance value is set in the safe SSD poor the The capacity 30 volume is set in the SATA poor We also use the sender multi attached to provide a shared value Okay Before we have overview have our block story servers So next we will focus on how to achieve high-performance high-performance block story servers And before we have that into safe or how to improve safe most we need to back to OS lawyer because safe is a distributed story servers and it related to many many many other things like OS network or HDD or SDD hardware So before we don't eat it we need to have our overview of about OS configuration Okay, let's see the first part is CPU and as we know now more and more CPU module try to seal power for for some more money reason and so It will lower the power power sale model so to save more power for For server, so if we want to get more Performance from CPU we need to equal performance to OS CPU power power sale mode and then we introduce a group for it We have the come by the story node and computer node to one box. So if we want to Make a make a full use of CPU. We need to bond the CPU OSD process to fix the cost In our production environment, we bound the one or two cost per OSD The second part is the memory if we If we use what to support in your MA for your system I stress you just turn off it because self can't can't make you make for use of it and may have some Performance performance degraded degraded So the next is set VM VM swap needs to zero I think you can know that if you page in and a page out process it happens self will very poor for it and And the the next two parts is both and us and the first time I Think each issue actually application need the Value of it. We need to equal data line to block device schedule and mount No, no time and no barrier with the first time What's the mean of no time and no barrier? I think Because safe your local fast smooth with With operator so fast seem to always record the access time as the file metadata So because several don't use all it and it's really harmful harmful to promote performance so we need to disable it and no barrier is a is a Is a is a durable future for for fail? And so we because we have general in step so we can disable it Next we will introduce our cumule backport and Some more enhanced features So mainly is introduced is some future like smooth I only meter and discard and flash Request enhance and I'll bust and the multi-quant support Monique support is mainly used by cumule so we can get a higher performance and IOPS for cumule and VM So next we will down into self our stock We can see the Bottle picture I Want to have overview of how I believe me to be used by by VM So each image will constants of for many many of us and a writer is self-read those the concept and Riddles will store a better as a file to local fast system So you can regard as I believe major will Consider many files in in distributed files So if we want if we wanted to improve RbD image performance we need to improve local fast system or how How to make use how step always be make use of it and Let's see the top answer top answer is the one I will request how to flow from cumule thread and to step always be and then store it to local fast system and It may be not clearly, but the picture want to show is that the thread is One I will request the need to flow many threads if you eat a want to store at local fast system So for example network thread will receive the I will cast and it will Dispatch it to always this thread and then dispatch again again So one I will one L thread will flow you to many threads So in order to improve the performance of step, we need to reduce the overhand of Context which be behind the I will request dispatch So next we will have seven points to Of all eight or reduce it So at first we need to introduce the first point is a keeper FD FD is all OS concept for local fast system safe OSD will get a file descriptor to handle the IO and To write or read the file data from local fast system. So why we need to increase FD because System calls such as open open open and close FD is very consuming much time for a normal IO So you can imagine if you we have many many data and we need to open many many fast So we need to close the sum and open some and and it always repeatedly happened So we need to avoid it in asset for SSD cluster set self cluster so as we should we need to Increase FD catch and the omap handle catch to a very large set side to hold office and Below configuration issue the first to omap handle catch side and the first to FD catch size is is used to control it and Besides besides this we need to change the office default of your size because if we have a large SSD and not a shoulder as Slides and we need to hold the more objects So we need to increase the of your size So by default we can we can increase office size to 16 million bytes And we can use the IPv6 size in overstack sender to create a different office size volume Okay, the point of two is the spice reader and write Okay, as I Machine the above One of the two is a file in local file system. So if you use a application in a VM we want to write a range of data to a volume and the object will All a few bike kilobytes of full data. So in in general we One object may only exist a few kilobytes data So it is really helpful for for SD and Capacity so but if we create a step shorter or OSD recover happened so it is it will Cooperate for build out and then so it another place So if it happened it is really harmful to normal IO pace So maybe if you use a self and you may get a very bad Performance if you create a snapshot for volume and that's that data to eat and you will see the very very harmful performance from the VM So but why we not enable the configuration you see FFST or FE map because it's we have existing bug for kernel XFS and we set up in order to avoid the data corruption So we just disable it. So if you have ensure your kernel XF is fix this bug or You have other ways to fix it. So you can you can enable it You can get much performance improvements from this if your credit central or SD recover happened So the point is 3 the drop of default limits as we know self is mainly designed to HDD disc so HDD is much slower than SSD SSD so Many configuration You need to change it for example fast door WB shorter and fast door Queen and the journal Queen and more related configures Related recover and scrap we can employ it ten times computer compared compared to our new view so Okay The four point is you have to catch I think many of you and know it and eater will have Remarkable performance improvement for sequence reader and write You can just do it and I think it's pretty good and The fire point is catch. We know catch is where plays a very important role for normal storage system So how to make use of it as I mentioned a bit about Self by default self use a very small catch size and it is designed to HDD this car So if you want to use a CD so you want to enable much bigger catch so in order to use it you need to increase catch size but But the default catch container Or Impletive implantation is not a suitable for large case size. It is not a very effective So we need to change the default catch in container to read them catch It is another catch in self source, but we need to do some more code. So So next we will introduce Implement implement or new effective effective catch placing little catch and the sixth point is keep a thread running as I much machine the above I All need to flow many many thread and If we network thread receive I will request it needed to wake up the next thread and the next thread Need to work a net-to-net thread. So if we So shred often wake up and sleep we can have a sleeper. So many many CPU time in this is a consumer eat So if you we want to reduce the old hand of full shadow sleep and we have we need to reduce the time consumed by By by sleep and we have so we just have a patch for it You can set all SDOP worker with time for it. And this time we can be controlled controlled to control How many time the thread need to week? So we can have a very better performance result from this patch If you use the SD back end so the final point is sink message it's experiment future and We still do still do some more cold coding for it, but I think if you want to building your or several SD cluster and want to Deploy it in a future maybe a few months you can you can focus on it and it can get much better latency for the no bio Why it can improve performance because The original Self network lawyer will use two threads for each clint with the cleaner increases Especially for cloud cloud Information we have many VM will attach or access OSD So we will have many many many threads will be created in OSD side So if we increase the thread in OSD side the connect switch Custom time is very very high. So we in order to in order to reduce it We need to introduce the kernel Inventor notification method for example, Linux is a poor interface and the BSD system is a key queue interface So next we will see the LPS result for it. It's a very thing a simple test and based on self dumping we know we have Previewed a fair fly release, but it is not a test. So we just see the dumpling release based on one OSD and the one client and the L depth is 16 so Of course, the Replicated side is one so you can simply compute the picture left picture is Origin dumpling release. It is nearly 1600 lps and right is our Patched performance you can achieve 10,000 lps and the SD is Intel SD so the hardware hardware lps is nearly 20,000 or 30,000 Because of self double right and some metadata Consumer we can make one OSD can make a full use of SD for our self cluster and For a master branch with self we can get I remember it is 3 or 4,000 lps. So it is still a big gap between our performance SD special branch so as for latency you can see The lies and we can get one one micro second for a million second for 4k read and write and for one one TB of the image and And There and 500 micro seconds for read OP Compared to Master branch we have much much much improvement and very outstanding logic data set performance Okay in the part In the part I will introduce how to make the high durability in the set for You need some knowledge about the set There are many documents about the several reliability model and after we read the document we can derive the Derivability formula in order to save our time. I will know it's played a detail of the several reliability Model and the door door ability formula. I put them in the last of our sliders When it when when you to optimize as you know in the Distribute the system that the data placement decided your ability in the set the crash map Decide the data placement so crash map decided your ability the default The crash map setting is not good for us. We need a new the crash map setting to get a higher durability What need to optimize? To conclusion from the several reliability model Do you know your ability to point on the OSD recovered time? Do you know ability to point on the number of a copy set in the safe pool? What is the copy set the possible of a PG's OSD set Is a copy set the data loss in the safe is the loss of any PG Actually, it's a loss of any copy set if the replication number is the three and we lost the three OSDs in set the probability of a data loss that depend on the number of a copy set because the three OSDs So maybe know the copy set So We can we know the shorter recovering time the higher durability the less that the number of a copy set the higher durability How how to optimize it as because the crash map setting decide the Recurring time and the crash map setting decide the number of a copy set So we need to change the crash map setting to get a higher durability Firstly, let's to compute the durability in the default crash map setting In the case we have a three ranks and so so each ranger have a 24 OSDs if the replication number is the three the number of a copy set is the 24 multiplied 24 multiplied 24 the result is so big You can use the common the safe OSD tree to see the crash map It's 34 the crash is setting in the case Replication number is a three the dual Ability nine is a the dual dual ability is a better than read the five and read the one yellow is Equal equal a rate of six But it's not good enough for us because the cloud is a larger scale. We need to hire dual ability So let's reduce the recovering time In the default in the default crash map setting the host the host a bucketer have a three OSDs If one OSD out only two OSDs they can do the data recovering So the recurring time is too long We need to add more OSD to do the data data requiring to reduce the recovering time But we cannot add it the more OSD in the host a bucket. Why because a whole Because a whole star have a network better with the limit and the disco slow the limit So we need to add it to the new the bucket in standard of a whole sub bucket to contain the the OSD The new the bucket name is OSD domain It's a logical bucket unlike a horse unlike a horse and rank We can edit the model OSD in the OSD domains. So the recovery time is a very short You can see the green why the frame is the OSD domain bucket That's the line in decay the deck. It's a logical pocket We added the new the bucket, but we have no change the number of a copy set in The safer all the tree. There are some difference. We use the OSD domain bucket in standard of Host a bucket So each OSD domain Domain have a travel OSDs In the new the crash map will reduce the recovering time The The durability 9 is 9 compared with the default crash map setting we improve a durability 10 times Next how to reduce the number of a copy set the message is adding the replica domain bucket, right? In the picture the blue why the frame is Re replica domain the PG only you side in the one replica the domain it cannot cross replica domains This is reduced a lot of a number of a copy set You can see the M in the safe OSD tree we use Re replica domain bucket bucket in standard of a rank bucket We also change the crash map ruler In the ruler we first pick up one Replic replica domain in the new the crash map will reduce the number of a copy set The durability 9 is 10 Compared with the defaulting crash map setting we improve a durability 100 times 100 times In the past though, I will share out I will share our Operation experience We use the popular to deploy our safer cluster our popular safer Model is a different different from in no one's and mistake for our popular safer module have a many Advantage such as shorter deployment time and Supporting all safer Options and supporting much it this time our operation goal is Is the ability we need reduce and Necessarily data migration and reduce the slow slow request When we after we when we upgrade the safer before you restart the safe OSD Process you never you near first Markdown OSD it will be reduced slowly request Before we rebut a host we also need a markdown OSD He also cannot reduce the slow slow request When we expand the safer Capacity you need a setting crash map and trigger and trigger data migration in the middle nine When we meet this disk Corruption we need to replace the disk the most important is ensure ensure the The replica domain to waste and change this are why the unnecessarily data migration Molding to we use a diamond to collect a performance the targets data and the safer cluster status The graphic to store the data and the Grafana is used to is used to to display the data safe safe the alerting is based on Jabbies According to the safer architecture we create a slotting model model the rider request will pass through the order slow slowters So according to the slot model we add we add the new the collector in the diamond and we define the Magic names in the graph it as you see There is a matrix in the graph it the graph it UI is ugly the picture so the fine store generally kills the latency This is a grandfather now the display the first is the lps the second is the bandwidth The third is always the general latency the fall is the reader request the latency In the past so we meet the many accidents The most dangerous Sydney's XFS bug the bugger will down all OST Thank you for your watching so The says is shorter and maybe you have many many questions I'm not sure but Anyone okay No problem. No problem. Yeah So any other question About the performance or crash map or others Okay close So from a hardware perspective, are you using traditional 1u nodes SSDs are you looking at any of the exotic systems that have much higher count? You know 30 and 1u or something just had a curious what you built with Sorry, I kind of get it clearly about your meaning. Okay. Okay 1u and 8 8 sd, but 3 sd each server and each link has has 18 server each server is 1u and Each server has 3 sd all is 10 gigabytes 10 GB one one only one