 Can we start? Can you hear me? Is that better? Okay. So actually I'm a little surprised because this is the last one. There are still so many people at this session so hopefully I can talk something good and not disappoint you guys. So yeah today I will talk in general I will talk about Swift and SAF performance. So here's the thing I will talk I will do a very simple introduction to save the time and first I will talk about SAF and the second I will talk about Swift and in I will give you a summary. So I will not compare Swift to the SAF Apple to Apple. So because I think that first really is I think today if we look at OpenStack, SAF is more famous for the block service and the Swift is more famous for the you know object service. So in general I will focus you know in different angle try to look at that. So I'm from Shanghai actually I came from Intel. I work for Intel and we have a team here working on the you know cloud technology and we're starting to look at OpenStack performance from two years ago. This is my third conference and actually I delivered the other talk regarding to benchmark called cost bench here in Santiago OpenStack conference. So all the content here actually is still working in progress. So there may be something wrong and we are glad if you give us a lot of comments and I want to emphasize this actually is a teamwork because we produce a lot of work a lot of data and doing a lot of environment and I also want thanks for you know people from Intel and the Swift stack because some you know we actually talk what do you see here and you know ask their comments and give us some hints so we can do some adjustment and you know make the things reasonable and I'm trying to you know blogging about details about this talk because I will have 40 minutes and we have so much data so I cannot cover all these details. I'm trying to blogging stuff and try to explain that so if you like you can go to the there's a blog link and you can go to that look for more details. Okay let's look at the safe part. So here's our testing environment. So in general we just take we just might as RDB mode I mean the block level performance. So we have four story node and all the network is 10 GB so we want to make sure there's no network bottleneck and for each node we actually have one processor and we have 16 GB memory and we have 10 1TB SATA disks. It's connected with you know LSI HBA with J-Bald mode and each disk is partitioned into one OSD and we also have actually three SSD play as the general and if you know a lot about self you know self have a special general design so they can make all this right run faster and do the snapshot thing. So we we have three SSD and you know just connect to the we have some on the host we have some local SATA controller so we just connect to the SSD to there. So this is a software configuration. When we're doing all this testing you know we stick to Vonto and we have the you know actually the methodology we test is we have an open-stack environment so we're just starting a lot of virtual machine. On each virtual machine we will test in the starting the simple workload called FIO and doing different workload pattern. So here's a version making you know that what you know set you know all this kind of kernel version because actually the performance really depends on the kernel version, depends on the q-mail version all this kind of different thing. So on the other side we just enable the general frame to make sure we can do a better sequential IO. We do several XFS. We use XFS here. We try previously we try BataFS but some of our custom tell me you know we want to FS because it's it's more stable so we just try FS here and we also do some self-tuning to make sure you can we can get something better. So there are several way you can test your self-performance right. You can just testing from the you know directly from the host you know you can also testing from inside of VM. So we try to understand you know from the custom view if people really want to use self we think the most common user model is actually something like they want to use like the Amazon EBS. So they mount the volume into their virtual machine and use the inside virtual machine they starting all this workload and doing all this IO. So here we actually use RD mode and we actually go through the q-mail RDB and the workload actually we try we use I file and we try four different workload pattern that's you know sequential read and write with you know 64k as a reading block and the reading block and we also try the random mile with a 4k. So in general for for you for a different use some use some mode. So one thing we want to you know instead of of just the throughput of the whole you know cluster the other thing we pay a lot of attention is I want to make sure the volume do a provider enough quality of service. So we defend some actually quality service requirement really stay here. One thing you know for random we think latency is the number one right because we always want to let all this IO return quickly. So we want to make sure that all the random IO have an average latency less than 20 millisecond. So that's one QoS requirement. The other thing is one thing we actually we did a some test on AWS we try to understand know what kind of for I'm no web service EBS they can provide. So in general we do we starting a random studying several we read do some running testing for seven days and during the seven days we starting a lot of VM and attach a very different volume from time to time and for each VM we will run for two hours and collecting all these performance data and try to understand if if the first then the EBS volume what kind of performance I get. So in general the things we get is you know they if you look at the AWS thing you can see that they say that a common EBS they will provide something like 100 alps per second they don't mention latency right and they also don't mention the you know the sequential bandwidth. So based on our testing I think in general their performance is pretty you know qualified to their SLA claimant so we set some goal here is we want to make sure that you know each VM can we hope each volume can you know provide 100 millisecond 100 alps per second with latency less than 20 millisecond so that's for random on the other side they want to make sure for each volume they can provide something like a more than 60 something like a 60 60 mic per second bandwidth so that's our QS how they do that so currently self actually you know they as I know they don't have a very good you know isolation design they they they can make sure you know if you have a body user have a lot of pressure and it depends on the for example if you use OpenStack usually what we do is we will use C group we use C group to control you know each each each VM how much IO and how much bandwidth he can consume so but here we don't use C group we just use some functionality from FIO FIO actually provide a feature you can site the mask the bandwidth and the mask throughput the FIO will generate so here you can see that when we doing a random IO we set the 100 100 alps per second as you know max throughput as a target and 60 mic per second as a target for sequential IO and based on that we want to make sure that if we have enough VM right we will increase the VM gradually when we have we we can expect we can predict that when we have enough a lot of VM actually the average throughput per VM will goes down so we set the QS that we want to make sure that you know for average for the VM performance actually it should be should the larger than 90% of different throughput so that means for sequential IO is 54 mic per second and for alps is 90 alps per second so that's two QS we define for this testing now it's a fun part let's look at this is a random rate performance so the X is actually the VM number it's also the volume number so in general we create a one VM and attach a different volume into a system and the the left side is the is a poor VM performance that's how much you know volume how much alps you get the poor VM and the right side is aggregation total performance so there are two to date you know the the market is the number we get is actually is for thousand of 600 that's actually gate is at 80 VM and but remember we have two QS you know requirement we want to make sure the poor VM performance larger than 90% of the predefined you know target so actually here if we take that we can see that when we scale to 30 VMs the VM helps your alps already job to 95 so that's that's a random rate data your question no you still will go down but because we set a QS number so we didn't do for the testing so we want to make I will show more data regarding to latency you can tell why we pick the 30 VM that's the other really because in this page I did the children yeah I know but because I think that the most really is I want to make sure I think if we want to offer some EBS service to custom we want to make sure that our cluster is not over commitment so we want to make sure we meet SLA you know yeah yeah that's that's the leader this is 40 and the 40 spin those and we use a replicated equals 2 so I will do some summarized in the end to you know if they wear the bottleneck for the side so this is a random right so the the curve is a little different compared to I read it actually you can see at beginning the you know the yeah maybe something sorry something wrong with this figure I think that yeah this is a latency actually so you can see that at the beginning the poem is pretty good because the latency is very small that's because actually most of the right hit the SSG cache but you know because if you can't if you for XFS if you use XFS so first day if you all these things will be written to the cache SSG cache SSG journal and then the staff will act to the client so client just mark the complete writing but when there's some more and more you know the pressure goes up so you can one thing is that you must also flush all the state from the S from the on the on the field stop part I mean on the real at this disk so you can see that actually there's a big jump you know when you move from the 30 VM to the 40 VM the latency jumps very quickly and the actually the the program performance also very job very quickly so we set the other goal is actually based on one QS so we also pick up the VM equals to 30 as you know you know the peak performance for the cluster so this is sequential rate so we caps every you know VM have that at the most the 60 mic per second performance so you can see that when you increase the VM number you can still get something go on but similar you know we see that we because we have the predefined QS so when we what we can see is that the VM number equals to 40 you know the performance is when the VM number is equal to goes to 50 the program super is job to less than less it goes to 49 so that's already break our QS number so we pick the you know VM the 40 as a as a you know peak performance and this is so hard right we didn't see you know the SSD general benefit on a sequential drive that's because I think it's view very quickly use all this space and use all this general space in the in a sequential right so but so in general in this case you can see that because if you understand the you know safe that for every read actually safe just to do a right physical rate that goes through a master right but for right if you use replicated in equals to actually there are two real physical right right happen so in general if you count simple compare read and the right right of you consume twice of all disk I own bandwidth and alps compare to read so in this case even if so we didn't we still are not meet we will see the actually peak number happen at something like 50 VM after that if it goes down but if we take the you know our QS number so we will take the something like VM 20 as a peak number this is the interesting thing to look at all this latency so we we have six line and one thing we we found is very interesting actually so latency is really depend how a strong dependence on the QD size QD is a parameter you can use to adjust you know in the FIO how money I always on fly before it's commit one thing we found actually we think there's still some bug or something we can do better on the client side and one thing we found is on the ID client side that sometimes there's only one thread so if you have a lot of if you have a long very long QD QD implementation most of my I always hype on the client side so I just show you some picture to let you know what's the latency looks like so you can say there are several line I think the green line is it's a let's look at the rate first for the random and you can see that the blue line is a it's a random right random rate and the red one sorry it said I'm not sure you can tell which is right or which blue yeah this one is a it's a random rate and this one is a random right so you can see that the latency jump very quickly because also the asset to general impact so in general you can see that actually the latency is okay for the for the for all this random the beginning latency is actually starting from something like 10 and it's gradually when you're adding more pressure it's gradually increase we do some you know latency breakdown we try to understand where the latency is goes on and then for random right I think for the for all this random read operation the latency is pretty good I we observe that most of latency go to the disk so we just to measure the latency on the file side so this latency is my the inside of VM from FIO on the other side we measure the we use our state to measure the latency from the story node so usually safe did a pretty good job the ID they will add too many second actual latency so I think that's pretty good but the the not very good thing is if you look at the sequential thing so actually the sequential latency is a little larger you really when we do you know sequential stuff on local disk right because we can see something like a one millisecond or you know two millisecond latency at most because there's almost no you know the spindle seeking happen but here you know it's some different thing happen so we try to understand why so we do a lot of we do some I'll book trees try to understand that I'll pattern so in general what safe did is you know safe you try to distribute these distribute all this I owe to the two different object across the whole node whole cluster so in series if you have a logical sequential read or write that happen on a virtual disk in the end in a physical node actually all this this guy will become the random one across all this stuff so there are the two figure first one is you know we're starting for 40 vm all doing a sequential aisle so you can tell that the red part is you know all this I owe one by the other so that's a real physical sequential aisle but the blue part is actually even you know all you'll do is a sequential aisle there's still some maybe 20% that goes to random and if you mix random and sequential together that's a red figure totally 40 vm and a 24 random 24 sequential you can see that actually the blue part become larger so that's the reason we believe that in some case because there are so many you know virtual stream they will come back together miss together and a heating one physical disk so that convert to that you know that the disco pattern become very random so that the disco need to spin a lot so that it makes you know all these latency increase a lot so this was some interesting one because of the SATA disk we also try the 4 SSD so in general I think the safety is pretty good at the 4 SD so this result is actually we use for SD as a single node and doing all this perform testing you can see that if we settle the latency QSS one millisecond we can get something like a 55 key helps for for one node and if you can let the QSS a little less so if you you can get almost 80 key with you know two millisecond so that's pretty good I think okay so this is a fun part so we try to understand you know if the safe class is efficient or not so we we summarize we have they have for four different lines and the first column this is the mass against you know throughput of a method but we didn't consider the QS requirement and this is a throughput if we consider QS requirement and this is you know in series that if we just think about the disk how much you bandwidth and office helps the disk can provide so they do some testing we use the disk model we use is a Seagate ES in the price SATA so in general it's can provide something like as see here is 90 IOPS per second for each disk and it's also going to provide something like you know 160 mic per second for sequential so based on this state we calculate the you know the disk through put it for the whole cluster remember we have 40 disk and we also consider for right because we must have right twice so we just you know half of the max the throughput so that's the reason you can see that this is only half and then we also consider the network support right because we in in currently setup of videos for 10 GB network for each node we have one 10 GB so in general if the this is very in theory you know so so in general if they for the for the small IO you can do a lot but for the big IO right so something like you know we assume that we can get at most 4,000 mic per second so we pick up the small one of this this tool as you know as a final you know a system you know perfect support and calculate the efficiency so I think a safe is very good on random and I personally think it's pretty good there's still something maybe we can do more on the sequential I just talk with in-tank people these days and we do some testing and we already can improve this performance something like 50% but they you know thanks for saying he gave me some other you know hints we hope we can work together to make this better in the future and then the other side let's compare the SATA and SSD so in general you can see that you know for the SATA actually traditional disk you you have a very big space but the helps is pretty low but if you use something like the just SSD the space is not issue but the performance is pretty good so in general we think we maybe we can mix the SSD and the SATA together that's a better solution so as summary for staff random is pretty good sequential we are still try to work on that and next step we will do continue do is a country we just working on an FIO so we will try to do a more you know real workload study from small one for example the six bench and gradually move to some complex one some enterprise workload the reason we do that is we try to understand what's the latency really impact to the you know the application performance second thing is we try to understand if we can use SATA better so in general how we can balance the SSD between the SSD general and also we can we can consider your SSD as part as you know the you know date part to I mean a fail style part cash sorry sorry I didn't catch that yes I think the staff did a good job his question is I'm not sure I catch that but I tried so his question is he think maybe if we replace date in a different way right it's non-balance that will affect impact the performance I think the staff did a pretty good job so they are crash algorithm will distribute all this stuff very good but the one thing we we do this these days is we do a special tuning is a default to lay actually if you create a pool and you create a volume right for example in this case we have one pool and in this pool actually we have 40 discs and if you create a volume inside this pool you all your data will distribute to all this kind of 40 discs so sometimes that's not a very good design there are two days and first one is if you create a 40 volume in the same pool so the possibility for all this kind of different volume traffic will be together that will become all this traffic will become more you know fragment and random the other things that if you create put all these kind of discs in one pool and if you have one disc of field actually it will impact a lot of volume so we do some tuning there we kind of think that you know you can create a more pool it's a little of the matrix to work for example in this case we have four four node four story node right so we just pick each disc from each pool each node and create one pool so we can create a 10 pool we can increase the sequential performance something like 10% and also that will gradually reduce you know disc affiliate impact for the whole cluster but for your question I think it's pretty good it's almost evenly distributed yes that's default okay that's for safe let's move to swift yeah I know somebody want to see some swift to safe comparison yeah so it is we actually have the other cluster this you know we it's very common you know we have a 10 starting node that's UP server and we have two proxy node and actually the proxy node is we have actually before that before the proxy node we have the actually proxy to do all this low-balance and anything more so for each start node actually we have the one processor and we have 16 gig memory and we have the one called part nick so recently Swift have a very good design you can bending several IP to one you know you know story node so that's a pretty good design so original actually we use L bounding but the bounding performance not very good kind is better and for each star node we have 12 for you know SATA disk and the Y SSD that's SSD used to hold all this container all this stuff so this is the configuration we use the latest Swift code and I will ignore most of the stuff here so you can I will upload a slide so you can check that later so this is my storage we use one workload is developed by Intel you know this we already open source that we introduce cost-bench one year ago also down on open stack conference and the country cost-bench you already support the Swift artist part of the site and we also support is three interface and we also support I'm play data I'm not sure how many know I'm play data so there are more and more people try to use that if you feel interest you can go to the website and we do two testing you know first one we call that a small skill testing second one so we call it a large skill testing for the small skill testing is pretty small we only create 100 container and each container have 100 object so we just try to understand what's the best performance we can get you know if we have only very little date for the you know for the large skill testing actually we create something more we treat we have two different object one first one is a small object is 128 key there is 10 mic so we have different for that for the small object actually we create a lot that's 10,000 multiple to 10,000 and for the large one we don't have so much disk space so we have to create a one a 10,000 container and each container have the 100 object the reason we create more containers some people told me that if you have more container and maybe you will have more pressure to your container service so that's the reason we create a more container compared to object so we ramping up for 300 second and the matter 300 second and we also define some QS because we think of the latency is very important so in general we want to make sure we get to the first bet something like less than 200 millisecond and so the QS latency QS is equal to you know the 200 millisecond plus the object size and divided by 2 mic so in general if you have a big object size it takes a long time to transfer all this stuff right so that's QS so this is a small scale it's pretty good right all this support all this latency is perfect and we can drive the CPU almost especially for small object right we can drive the bottleneck for the small object testing is CPU so we almost use all this CPU and for the large object actually the bottlenecks the network because we only have you know to 10 dB link so that's the bottleneck so let's look at what's happening when we increase the number so so the big object actually is a pretty good it's almost same because the network is bottleneck we didn't change your network but the small object you know it's a job a job a lot and we believe most of the you know the issues comes from the disk I will show you more about this yeah let's say that 80% and 58% the performance degradation for the small object testing so actually this is a I think a lot of community gallery know this we talked about this in last disaster meet and so in general we look we we try to compare the you know large scale and the small scale you can see both you know actually the oil pattern change a lot so you look at I show you something like you know the latency of a time and on the left side is you know the typical size for read and write so that's changing a lot and we also do the use a box trees to capture what's happened so in general there's a lot of things happen is a metadata that's a fail system but fail system overhead there are so many I know that and all this metadata information you they cannot cash in the memory because the memory is not large enough so in that case the swift must wait for all this kind of I know that metadata information so one thing you can do is you can have a big memory so this is a test we did and the blue one is a small test and the you know if we have enough memory and the red one is actually we do some preload thing and the sorry the green one is a small scale that's a perfect target and the blue one is actually if we have enough memory so you can say that if you have enough physical memory with time goes on you can catch the most of the I know and all this metadata information into the you know memory so the performance is a very close and there's a big M if you don't want to wait because if you you can do some preload it's there's a command you just side you know the VSS cash press equals to one and do some you know IRS to make sure all the all your I know the information can be cash so that's a lot for the right so if you do the prefetch and preload actually the performance is good enough but this is not very good because memories is pencil right so we try to figure out or something else so this is a second thing we try so we just use SSD we use SSD and the flash cash but I'm not sure how many you heard about flycash flycash is a Facebook stuff you know he can you know take the SSD as the actual cash so we can fix the flat cash to make sure the SD only cash I know that the metadata information so that can actually improve the problems a lot something like we can get something 50% to 100% perform improvement compared to if there's no SSD but there's still a big gap you know compared to the perfect case so I think maybe we can do something once thing is so I so in general I think of a swift big object is okay and small object is need some tuning the first one is in the latest XFS you can use a small you know I know the info I know the size original the default I know size the one key trunk till he suggests that we can go to some small one something like 256 but that you can only use that in the latest XFS in the old corner you cannot use that we are trying to do some testing it's still in working on progress second things people talk with us you can use a different file system right but I don't know that's that's my custom done like that so so since there's also some discussion I think in the safe community people talking about how to handle a small object right this discussion lot like you know let's use a live DB let's use something different we just don't use the file system and the next one is like the he stack or you know our TFS actually they just adding actual layer they they combine the small object in a big one and so exactly so you can reduce the IO to the physics disk so I think a swift today actually have one patch proposal one blueprint from right head called LFS I think maybe we can do some in that area to you know group all these small small objects into a big one and then reduce the disk operation and for ourselves we will continue to do more testing to try understand how to use SSD because that's most the simple way we just tried right through and we will try right back you know and we will understand what's the latest they go and we will also with some customers of us also have a car say that if you use the flash cache right and how what will happen if the SSD go so we will do more testing there that's next step okay that's all so as summary I personally I'm a big fan of the both swift and safe you know so I really like this to software and swift is very simple and it's easy to use safe is it's a high ad hoc structure and I think their performance is so it's good but still a lot of things to do so we hope if you guys want to work this together we can work together to you know make this better that's all thank you