 All right, thank you everybody for joining us for the next track here or the next presentation in our seftae track at open source days Just as a reminder before we get started if you do have any questions these are being recorded for posterity So please use the microphones so that everyone can hear you as well as the camera can hear you So our next presenter is is John from Intel. He's a part of the Intel China team That's been working with seft for a long time He's gonna talks about some of their performance work that they've been doing Specifically on some of the new hardware stuff. So John. Yeah, thank you everyone. This is Jim. I'm from China, Shanghai I'm currently leading a team doing the self-eye optimization inter-architecture optimization and also open source a lot of our tools like Cosbench C2 and VSM so there's another guy he's from our SSG department But he kind of laughed up because of some emergency so I can I will you know talk about the 3d cross bone Stuff on behalf of him, but if there are some kind of a deep questions I can I probably cannot answer and we'll check with him again So this is the brief agenda and first I will talk about to set in China So now then I will go to the all flesh configurations way proposed We have three kind of a configuration and the next we have opt-in launched just like April And we have the TRC 3d 9 SSD launched just last week So I'm going to show the 2.2 million 2.8 million RPS We get on the reference architecture and with some kind of a Demonstration of a how opt-in is going to improve your self cluster performance after that. So we will so currently the some kind of a Scalability analysis watch the watch is the current problems and how can we make the performance even higher and the last part will show How this is reference architecture is going to evolve with the opt-in and a 3d 3d TLC 9 technology So Self is pretty hot in China Work with the particular closely. We kind of hold like three safety in China. It's attracted like one thousand 18 days from like 500 companies They are kind of a from different types of a company. So what I observe is actually more and more companies. They are started to do Great development based on staff in their product. That's why those are some kind of OEM and The traditional enterprise company. We do see a lot of adoption in the cloud service providers And it's really hot that some media they just jump by and do the self media coverage So you probably noticed that there are a lot of China Contributors in the code code base now So it starts from when compare with when stuff do the same thing like back in 2011 The contributors the opens up contribute. I mean is it becoming more and more? So who is using staff? Actually, this is based on our customer engagement work and also searchable examples the first part will be the telecom because With a system skill and the data increase the traditional enterprise story is not going to hold their data and the performance is kind of Cannot satisfy the workload. So as one of the customer said the story the cost is pretty high It's like a 30 to 50% of their IT cost. So it's kind of limited the scalability of the traditional idea the traditional discovery and Make it very difficult to operate those system. It's kind of a Operating cost is really high. So there are some successful examples like China Unicom One of the biggest telecom in China. The second party is the cloud service providers. We call The non tab is an interesting word with non Tencent Alibaba and bad do those are the three biggest ones in China, but The rest of ones they are moving away from the traditional sauce or sans solution to the open source scale out solution also from the consideration of the cost point and the third part this is such such boy examples include like retv it's kind of a YouTube like us. We also have a C trip is a travel agency company and PR cloud it's a Small cloud service provider. Also, we have a lot of OEM ODM here They are trying to build their self-based the story solution. It's a hardware and a software all together The search boy example here is like a 30 and QCT We collaborate a lot with QCT on this area. The last part is the traditional enterprise and research institutes the further research institutes their interest to mainly focus on self file system and also Some something like the TV stop back end and for enterprise They are likely to you know, deploy the self cluster in the open source with by themselves So they probably will purchase some kind of a third party support So let's see why we doing self our own flaccid rate The first reason is those service providers that are trying to provide a high performance storage back end for their either Private cloud of public cloud. So they are trying to provide the EBS like services The second reason is there's strong demand now to running the enterprise workload in the inside classroom now So what I observed is a lot of customers really want to run the sequel like workload my sequel Oracle stuff But to you know in this scenario Performance is really important not only performance, but also latency. I don't mean the average latency I mean the tail latency because like those Seco workload the latency is really important if you're if your latency is really high Maybe the transaction will about so your workload can simply not run. That's one thing and the second thing is We can build a multi-purpose, you know self cluster What's the advantage if you run my sequel in staff to compare with your traditional way They say you can scale out as you want It's a it's the operating new cluster and the benefit that you want most the last part is Performing the studio important. So that's where we do this on the latest launched opt-in cluster so the the last part and Probably the most important one because the SSD price is continue John dropping and With the TRC 3D none sd. We believe there will be some point to that in your future The sd price is will be much close to the internet south of high-end HDD drives So let's take a look on the three configurations. We proposed we have three configurations here We call it a simply call it a good better best the good one is you have a Sata or in the ME PCIe SSD at your journal or red-hot log or rock speed stuff And you have the hard drive as a data drive. So your processor doesn't need to be very powerful a Traditional e3 processor here is pretty old one kind of walks. You don't have a you do not need to have a large memory and High-speed network thinking about probably do this scenario though. This is in this type of configuration is cut targeting for the the scenario where you need a high capacity and You you don't really care too much about our performance So in this configuration probably you can simply calculate to the throughput of the performance you can get By multiplying in your your number of drives the second and configuration We proposed all flights the second and third are all flights configuration. The second one is the you're probably using a Sd like the s35 10 here The use scenario for the you have a high performance requirement for throughput RPS or L salary, but for the last one. It's all in me configuration We have a top beam processor here in e5 2699 v4 We have 128 gigabyte memory and we have an opportune drive p4 800 as the as the journal or red-hot log on the p4 500 sd as the data drive So we actually public some kind of a self tuning in the self.com you can take a look some on the The way we walk to today on on the tuning way we did on the offline configurations So let's take a look on the the latest Optin performance with a blue star Before I go into the details. I like to show a picture a promising result from the day one We're doing the performance job You may notice that the first release so we do on a senate breathe UP processor. It's kind of a UP means the unit processor it's a single socket. It's a 20 e3 1218 processor, which is pretty slow. That's a hard drive cluster and then at the 0.8 6 we went to the first off lesser a configuration with the s3700 at that time the throughput is pretty low it's like 3 star RPS per node. This is a normal kind of and normal input node throughput and Then we have the the one with G maloch This is a kind of a non-optimization to the sub cluster now off lesser a you have a 3.7 times performance increments and then we move to you know the next release 0.94 and After that we move to the NVMe configuration p3700 P3700 with the P3700 at the NVMe drive We have a like a 1.6 6 6 times the performance improvement and the the biggest one 1.98 is actually from Fast to blue star, you know, this is purely for 4k random right not for read. I'm only demonstrating a right performance so it's interesting when sometimes you see like from the The broader web cluster me the e5 to 6 9 9 before we have a on the zero zero 11 0.0 0.2 The performance is pretty high, but you you may notice on the 12.0 the performance are slightly job because some kind of a tune is does not work anymore So we have to retune and make it back again So at that time we switch to the option drive cluster with with we change the p3700 with the p 4800 you can see that has a 20% performance improvement Yeah, it's a bit it's a much lower than what we can expect it because you know the the p 4800 is like a 20 times faster over the p3700 So let's take a look on these details This is a reference to the architects that we proposed it have a note each node was configured with one top being e five two six nine and processor and 40 gigabyte net gigabit network Now we have eight OSD here each OSD was configured with one of the opt-in drafts and eight of the P 4500 TLC 3d 9 SSD drive So we actually run multiple configurations But the best performance you can get is the two SD instance on pro drive so we are running on staff the latest stable vision at that time and with the you want to 14 01 So let's take a look on the performance first. Probably you have noticed this during our downboots So this is what we can get the tape today For 4k running read it's a 2.8 million RPS The average latency is pretty small like a 0.9 millisecond and the watch watch most bad part is for the tier latency I mean for the phone latency. It's only two point two five million second and for 4k run, right? It's a 600k RPS a full millisecond the average latency and the 25 million second for now latency, but Let me give some more information on the tail latency part Here if you are using a P 3700 the tail latency will be like a five million second So opting can significantly reduce the tail latency in this scenario and for the 64 key sequence of workload the performance is Good, and I don't think we actually hit the hardware limitation Maybe there is something need to tune like the frame says now, but this is today what we can get on the highest performance So back behind this highest throughput. This is all the types of attunions and the evaluation we have done and starting from Neuma like let's see Because usually we you do not pay attention to the Neuma workload Like self you do not bounding those OSD to the Neuma node or the cause directly But we do observe that if you are bounding the Neuma The OSD to a specific Neuma node it can improve your performance like a 20% especially for the 4k running Performance, then we did a hyper threading tuning The same way do this because we observed that if you have fall drive the per node the performance Is almost the same as a to drive the per node that because the CPU is kind of a fully utilized So I have a detailed data on that part. So helps ready in this scenario. It's done into really helps for the 4k random read and random write workload The the set the third thing is a G malloc and a TC malloc with Blue star and with the new release of a TC malloc It doesn't it's done in the matter whether you you are using TC malloc or J malloc and the most interesting part is on the Drive scaling you see the the first part we do a 8 SSD per node configurate the comparison with the fall SSD a per node configuration. It's actually Darned helps a lot. You probably expect like a hundred times of sorry two times performance improvement But here you may see only 20% that's simply because for the For the running workload, you probably hit the hardware CPU limitations and then the the second thing here our Best tuning practice previously is for those in the me configuration You probably need to create like a four OSD instance for drive But with more SSDs that that configuration is going to change. For example, we will have four jobs per node Four OSD per drive may be a good configuration. You can get a highest performance But if you have eight drives Maybe you need to change the OSD per drive to two that one is the highest throughput And we also try the three here. It's kind of a lower than the two OSD per drive configuration so the conclusion is the node scalability looks good is much better than the drive scalability and The disk is like scalability. I mean in the 4k block the small block workload. It's really really bad You so we need to optimize the CPU part and for the for the All flaccid rates we should pay attention to the new martinians. It can help a lot for your performance Here, let's see the opt-in where we are opt-in can help for your performance The first part or we are actually running bluestar using opt-in drive as the red halogen and the co-located with the Rocks DB database So the red picture is with the P3700 as a DB drive You probably see that the performance flag fluctuates a lot But with P opt-in it's very stable and you may see a decrease We believe that is caused by the SSD garbage collecting because the SSD is not in that stable status So at first you run you write to the SD it's really faster because now got the garbage collecting is involved but then At some point it will involve those process So what's new working trying to figure out what's the road cause over the degrees and at this moment and from the drive perspective you can see the the drive latency like The opt-in is a real stable. It's always smaller than 0.1 millisecond, but for the P3700 When it fluctuates and when the Rocks DB start to do the compaction You may say that the P3700 SSD latency increase a lot. It's up to like 30 or 70 millisecond So in that scenario or in that case the Even the first previous MMSD at because of it to become the performance bottleneck So this is the second thing on the more details on the Rocks DB stuff We actually take a look on the Rocks DB thing see how opt-in is helping to eliminate the Rocks DB bottleneck right now here So our previous actually with the P3700 What you observe in the performance with the Blue Star is at some point the right throughput may be job to zero That's because Rocks DB is doing right stuff. Rocks DB kind of has a mechanism when When the flash data, they cannot keep with the rate of the incoming data It will do right stuff. This is nothing with the compaction The left pick so the second row is about the compaction. You see compaction happen a lot of time But you do not see right stuff all the time But when when the when the right stuff triggers the submit latency increase a lot And so your performance sucks It may be extremely low and with opt-in we totally we completely Elimited the right stuff in this scenario because it's a it's a right speed is like 20 times of P3700 The last part is on the latency path like a like what our colleague 2000 and Jason said in the previous session the tail latency is really important So opt-in helps a lot in this scenario We have a three types of latency like 95 99 and for none the for now latency is reduced like a 20 times with opt-in which jobs a lot Okay with all the performance that to take a load on the analysis The first part is what we taught up talked about to the CPU over had so this is with the 4k random rate and a 4k random right Take a look on our right picture. The CPU is like a 95 average This is actually happening on the first to 44 calls The last late part from 45 to 88 it's not that kind of a fool That's think simply because those are the high-prothreading cause it's not a HT it's kind of have some scalability issue It's designed for the sequence of work more than for the random workload You cannot expect a high performance improvement with the hyper threading cause if you are using a random and CPU intensive workload so the This behind this scenario is that Probably you can only run for draft on a top beam them e5 to 699 cluster Adding more drive to that node doesn't helps a lot and the same thing for the 4k random right So the high CPU toilet is it's really limiting the drive scaling Scalability per node it doesn't scale up. So that's one thing we may need to Pay attention to we are building our flat array cluster So we take a look on the proof record on who takes multiple the CPU cycles. This is for the 4k random rate Sorry, it might be a bit smaller. We run port for 32nd under the several SD part. You may see like The most part is actually the acing messenger the network messenger layer It's it the overhead is really heavy and for this picture you can have a even more Clue on that part we break it kind of a break it down. You have a lot of connections socket connection a lot of Epo driver overhead and a lot of a TCP send a message receive message overhead So That's for the OSD thread part Let's take a look on the TP OSD thread TP data in that part the first one is the system call rate cannot evaporate and the second one is the OP OP will kill process and that's why the primary PG do OP We were actually expecting most overhead is from the do OP but in that in this picture you can see that we do need some kind of Optimizing in a network messenger layer to to reduce the CPU ish and so you can put more SSDs in your one node in your node Let's take a look for right similar thing here for the Red halogen remove thread. It's like a 1.8 percent overhand for the acing manager. It takes like 28.9, so it's still a lot of overhead here And the Rocks DB also takes quite a bit overhead, but that's kind of a relatively smaller in this scenario So blues doll down into really takes too much of the overhead. It's like a 32 or 36 percent total so Then this is on the CPU profiling part then we look at on the Red halogen log tuning part we do a bunch of tuning in this scenario the default baseline is with the MME as the database and the Red Halogen log journal Sorry, Red Halogen device. So the the throughput is like 340k with a separate Rocks DB database and the Red Halogen device. The first tuning we do is Keep with the database on the MME drive. We all move the Red Halogen to a RAM disk So we want to see Rocks DB in a RAM if you put WL on the RAM disk how it improves. It's like only 20k RPS improvement The second thing is we completely in limited Rocks DB Red Halogen. It's a it's almost the same so which means Even if you have a fast device you in memory as the Rocks DB Red Halogen You cannot improve the performance significantly. So in that type maybe the the Rocks DB stuff isn't a problem any longer and the third tuning we do is We completely, you know Skip the Red Halogen Red mode. We don't write the Red Halogen So this is the highest performance we have got but it's not safe, right? So the last thing we do is we're trying to find a configuration where you can improve your performance Well, keep your data safe. So why we try to The the the fourth thing is we try not to remove the Red Halogen So keep the Red Halogen constantly writing in that scenario the performance is the lowest one which means that If you do not recycling the Red Halogen stuff and constantly write to the Write the metadata to a Rocks DB Red Halogen and eventually your performance will degrade a lot so Here comes the best thing we think reasonable. You have a another drive as a y'all Red Halogen. So it's like 380k RPS and it's safe. So this is our recommendation Maybe in the future we can have two drafts if you want to a really high performance Configuration you can use two drafts each one hold y'all one hold y'all Rocks DB database one hold y'all write Halogen Your data goes to the TLC 3d nana SD stuff Okay This is about our performance analysis stuff We want to deliver two methods the first thing is the network layer is kind of The overhead is pretty high If you're trying to put more SSDs in one single node that part We should pay attention to and the second thing is for the Red Halogen stuff and Rocks DB You can expect that to with fast devices. You can Even with a faster devices the performance and probably won't increase too much because in that scenario Rocks DB compacts And what have it only be a problem and longer So let's see what we we are going to propose in the next It's for this is the internet kind of have to kind of a technology the first one we call it 3d crosspoint Which is opt-in SSD and the second technology with 3d 9 If we take CPU as a central the first the most closed one in the DRAM You have a small capacity DRAM at lowest and latency and after that is the opt-in SSD Which the capacity is much bigger, but the latency is also very later increase a bit and on the on a third Third ring we have the 3d 9 SSD the cost is much lower like the p4500 sd The cost is like a 30 cents per dollar and but for the opt-in drive probably much higher So you can expect in the near future that the TRC 9 SSD Cost is extremely lower and don't may probably do not use the hard drive and longer So we have an extremely fast device based on the opt-in draft and we have a high capacity and good endurance and low cost to sd with 3d 9 technology So we are hoping that in the near future In the off-rath array configuration with the opt-in draft We can kind of a put a mall TRC 3d 9 and sd together to build a high performance Low latency and high capacity and a costly effective solution. So The simple thing is so we can put a journal log in Metadata on even catch on the opt-in drive and put your data on the TRC 3d 9 sd So don't worry about the endurance, you know the p4500 sd The endurance is pretty similar like the p3700 that are on the same level and Okay, so in this way we can provide a best I OPS per dollar best RPS to put gig terabyte and Maybe a terabyte port rack configuration, so You probably know that we have another form factor on the 3d correspond technology on the device will be kind of a Percent of memory so here is some kind of a platform in work. We done with percentage mammoth device. We try to Try to see how can we use the percentage of memory in staff? So we do this is a PC walk This this is a bit actually based on a library called a LibP memory. It is for the different types of Use scenario like you have a LibP man block. You have a LibP memory object It's a it's a whole bunch of a library that can let let your user level application Access the persistent memory device directly bypass the file system bypass the kernel so here the first try we kind of a some in a poor request here, but That's still depends on a lot of the Testing evaluation especially performance assessment to win the hardware is available So here here comes the summary Suffice also you can have a different kind of a configurations good about best configuration second thing is that we do see a strong demand on off less configuration from different kind of a customers and The optimum based off less of the self cluster is capable of delivering like 2.8 million RPS You know extremely low latency Not only on the average latency, but also on the tail latency But we still need to work a bit to make it more efficient with the off less of a configuration So the next set of weight. Maybe we now the next step we will try the Optin with the client side catch to improve the accuracy further. Oh Okay, this is actually a teamwork. I actually share the credit with my colleague Holden and Jim Hunk here in the back part we actually attest all the detail self-tuning configurations Including the rock cb tuning stuff here The blue star tuning stuff here and all the debug level tuning so you can see all the tunings in the scenario Okay, thank you. That's all the content. Any question? Hi everyone. That was great work. By the way, thank you very much for doing this. Thank you Will this be available online? Yeah, I think those confiled. Yeah And secondly just to comment we saw the exact same thing with PCIe drives about four of them will max out to 96 Yes, or 2699 so very interesting. Thank you very much. You're welcome I haven't seen a presentation, but the 2.8 million is what is the size of the cluster for key? Oh, the size of a class 8 no 8 nodes 8 node with 64 drive 8 drive per node Okay, but in the 2.8 million is client Ios. Yeah client out. Yes, okay It's a with FIO client out. Yes FIO running on the client side with FIO and a liby bd directly Hi, um, you mentioned that you have modified the rocks DB to provide a real-time compaction Well, this part of code Published in github or anywhere and should not read real-time compacting the rocks DB tuning at all here You can see it's kind of about to the the buffer numbers and the the trigger threats about and what do we add is the Kind of an event to chase there in a rock DB stuff and you can actually chase the all the racks DB where it dumped it dumped it out Oh, I see you mean the this part The all the counters, right? This one Yeah Yeah, we can we can set this out Okay. Oh, I will kind of sit into the safety with me Okay, thank you. All right any other questions. No, all right. So those of you that are whoever it was that asked about the slides They will be up on the Ceph Slide share account where most of our Ceph day content ends up So go ahead and take a look at that if you don't ask on the mailing list and someone will point you the way But other than that. Thank you, John. It was good. Thank you