 All right, we're ready to start the next session here. So the next one we have Nicholas who will be talking a little bit about some of the work that you've been doing around NVIDIA. Hi, good morning. I'm Nicholas Poggi from the Barcelona Supercomputing Center. And today I'm going to talk and show some benchmark results both on the system level, so the device level, and also on the application side. On the application side are applications, use cases, HBase, and the drives we're testing are NVMe. NVMe drives, non-volatile memory express, are basically SSDs on a PCIe bus. So that's what this talk is about. This work is a collaboration between Barcelona Supercomputing and Rockspace US. And it's half academic and half industrial. This is the first part of our research project. So we're welcoming feedback and contributions to the results you'll see. Okay, so the outline, I'm going to first introduce PSC and Aldialoja project while we're doing this, a bit of the motivation. And then we're going to get into system level benchmark of the NVMe devices just to set expectations of the maximum performance we can get from the drives. And then we're going to get into the core of the presentations are the HBase benchmarks. Separated in two parts. One is a read-only workload where we get the most of the benefits of the benchmarks. And then a mixed workload where we're actually deleting, updating, inserting data into the stuff. Okay, so the Barcelona Supercomputing Center is the Spanish National Supercomputing Facility. We host the Marinostrum Supercomputer. This year, through the European Commission and Spanish government, we're getting the new version, Marinostrum 4, for you interested in supercomputing HPC, follow the news. We're basically based in the Technical University of Catalonia in Barcelona. And so most of the people in the center are either professors, students, or PhDs. And so we have a very academic track. And we also have partnerships with industry players. The reason I'm presenting these results is we've been working in the Hadoop Big Data ecosystem since around 2008. First on schedulers and making sure that a concurrent job finish in time. And for the past year, we have been embarked in a benchmarking project for Hadoop ecosystem applications, specifically in cloud and HPC architectures. So this is a project. It's an open data and open source project. And the idea is to automate characterization of new hardware deployments and optimize software configuration. And throughout these three years, we have built a benchmarking platform. You can download and play with it. Basically, that's the provisioning of the clusters, either on-premise or cloud. It's set up the application you need to benchmark for, let's say in this case, HBase and Hadoop. And it can run the different tests and change the configuration between the different test runs. And after we run the benchmarks, we collect the results in an online repository. This is available online, so you can browse the results of the benchmarks. And on top of that, we're doing some analytics. So we're doing performance metric analytics and high-level metadata learning and also prediction models on... Try to find out and automate how to improve the performance of the systems. We collaborate a lot with the industry and academia. This work in particular Rackspace is present, and we got some support from Intel to make sure that their drives are configured properly. Okay, so a bit of the motivations for the results I'm going to show you. Here I have some results. The actual numbers are not too important, but we got a new cluster with NVMe devices that are supposed to be very fast. And so we said, okay, let's run Terasort. Terasort is like the default benchmark for Hadoop applications. And let's run it. And this first bar is... Lower is better. So this is running time in seconds. So this first bar shows the Terasort, sorting one terabyte of data in Hadoop, only on the NVMe drives. And we got this number. And then we tried a combination of... We had a J-Bot on this cluster. I'm going to present later. And then we tried... Let's use the J-Bot, 10 disks plus the NVMe drive. We got this number very similar to the first one. Of course, we have more disks here. This could happen. But then we said, okay, let's only use the J-Bot. And we had the third bar. And then let's only use five disks of the J-Bot. And we had less than 10% performance difference with only using the NVMe devices. So we're saying, what's going on here? We've been running this code for years. We know Hadoop is set up correctly. These are set up correctly. Why aren't we getting a good performance increase from these drives? When we switch from rotational drives to SSD some years back, we saw up to 3x, so 300% performance increase. So initially, we wanted to see the benefits of using NVMe drives. So our motivations are to explore use cases for big data where these type of drives make sense. We got these poor initial results. So we contacted Intel. And they supplied us an H-Base use case that we tried to replicate. And we also wanted to measure the possibilities of these drives and actually get an extender platform into benchmarking the devices, not only the big data clusters. And we found out that it is a challenge to produce big data application-level benchmark that actually stresses a hardware completely. So the reason for the marginal gains here is that this is a very high-end hardware and maybe the workload is very small or that the J-Bot is really fast. So let's look at that into detail. So the cluster specification... Oh, you don't see on the screen. So let's... Maybe like this. Okay, let's use this format then. So the cluster specification, we have five nodes. One is the master for working nodes, CentOS 7, 128 GBs of RAM per node. The master has like a rate 10 disk for the data. The OS is different, 10-gigabit network. And on the working nodes, we have the NBME drives, 1.6 terabyte of storage and a J-Bot, a SAS J-Bot of 10, 15 RPM seal gate disks. In the slides, you will find later the reference if you want to see them. So the NBME drives we're testing is the Intel P3608. It's not the newest from this year, but it's from a year back, or a year and a half back. It promises 5 gigabytes of read, both random and sequential read throughput and a write bandwidth of 2 gigabytes per second, so 5 to 2 gigabytes per second. And the price is around 10 K US dollars that I searched last week when doing the slides. We also have a second NBME device. This is an older generation. It's an LSI nitro warp drive. It's from 2012. The second price for this one is 4 K per unit, but when it was released, it was around 12 K. This disk we're just using for verification and validation. This is in another cluster. And what's different of the drives is the first one is PCIe3 with eight lanes and the other one is PCIe2, so it's an older generation. Okay, let's start with the FIO benchmark. FIO is flexible IO benchmarks. In my opinion, it's one of the best benchmarks to measure disk performance. We can talk more about later that. And what we want to do with FIO is first, we want to assert the vendor specs. So you buy a disk or you're going to buy a disk and you get these numbers from the vendor of what the disk is supposed to do. We want to verify if we can really achieve these numbers or those are only on ideal conditions. The second part is we want to make sure that we have the hardware set up correctly. For example, on this Intel drive, we had to update the firmware to actually get them performing well or to their most capacity. So this is something important and we also want to set performance expectations. So let's look at some results of maximum bandwidth. So this is megabytes per second that we can achieve in a cluster. So let me guide you through the results. All of the data will be on the slides, so I'm just going to highlight certain things. So each color here is a different benchmark. We start with random read. Random write is the orange. Random write... No, this is sequential read. It's the gray. And sequential write is the yellow. So both the highest bar here, this is higher is better. This is throughput megabytes per second. And both these Intel drives get close to 5 gigabytes per second bandwidth. Here the random is a bit lower. The spec says the same, but we got very close numbers to the spec. And for the write, we get the 2 gigabytes promising in the spec with a particular configuration of FIO. So this is with the configuration that we got the best results. This is what I'm showing just to compare the disk. Then second here we have the SAS J-Bot, the 15 RPM rotational drive. And here's the 10 disk combined. One thing that we can see is that the random reads and writes are much lower than the NVMe devices, but the 10 disks together has almost 2 gigabytes... Yeah, gigabytes throughput bandwidth. This will be the result with only one disk which gets the numbers that you would expect from a single SAS drive below 200 megabytes per second read and write. So sequential is quite good on the J-Bot. And here's our other PCIe express drive that actually comes with two disks. So this is the results we will get with only one disk. And pretty much we get double the results with a second while using the two disks together. We see that the maximum bandwidth on this one is around 4 gigabytes per second. And the random writes are a bit below 2 gigabytes per second, but not so different from the newer generation Intel drives we got. So about latency, we did some tests also to see the latency of these devices. All of the NVMe devices have very good latency, so here lower is better. It's below 400 microseconds, both for reads and writes. And the J-Bot, as we expected on the SAS drive, we get a higher latency when we do random reads and writes. But when we do sequential, the latency is quite low and even below 300 microseconds in some cases. This is with a particular FIO test with 64k in request size and one in IODF. FIO has a lot of configuration options. In the end, slides I will give the summary of them. And yeah, there's some notes on the slides for later. So let's get into the application, so HBase. HBase is the big database for non-SQL. It's built on top of Hadoop. You can actually put it into any file system, but usually people put it on top of HDFS for storage and safekeeping. It has good properties, close to real-time and low latency for random access. And this is surprising for a Hadoop ecosystem type of project. And it's being used a lot in production. And usually HBase is sort of a building block for other big data projects. So some other projects that have a SQL interface that actually store the data through HBase. So let's see how the HBase model works. Every time you do a write, it will be a put in the system, put a value in the system. This data is written directly to disk into the write ahead block for safekeeping, but it's also put on the Java heap space into a mem store. So we can serve it directly from round very fast. This gives HBase low latency. Every once in a while, the buffers are flashed to HDFS disk into the H files. And if it's not in the mem store, data is read from HDFS. It goes into the block cache. A block is the unit of storage in HBase, of different types of storage. And the block cache is called a level one type of cache with LRU, least recently used algorithm to a big least frequently used blocks. So in latest versions of HBase, I have added something called the bucket cache. Now bucket cache, you can think of a level two cache that can be off heap. So you can have HBase controlling a larger piece of memory for its cache. It's fixed size. When you set it up, you say what size you want it. You can set it up on heap, actually on your Java heap space, but we don't recommend that option. We only found marginal improvements and you will be competing already with the block cache. You can set it up off heap, so different Java interface and different Java heap space through Java NIO. And the interesting part is you can set it also to any file on the file system. In this case, we put it in the NBME drive and also into a RAM disk for testing. So here you have the schematic, here's the Java heap space, and then you have off heap the level two cache for HBase. So we performed several experiments. Let me summarize them. So first we have a baseline, HBase 1.24 without any special tuning and without bucket cache. The second is let's put the bucket cache on the off heap but managed by Java. Let's try that also into a RAM disk. Let's put the bucket cache into a RAM disk and let's put the bucket cache into an NBME file hosted on the drive. The difference here is that of course we cannot allocate all of our RAM for caching so we use 32 GBs of RAM for the RAM nodes, the RAM configurations, and 250 GBs per working on the NBME devices. So NBME has a larger cache size. This is as expected as we have a larger drive than we have memory. So the experiments that I'm going to show are most of them are read-only. The first group of them are let's try a use case where we are only reading data and then a mixed type of workload. The benchmark is YCSB. I will get into the benchmark a bit later. And the payload is generating around 250 million records which accommodate to two terabytes of raw HDFS storage. We're using replication of one here so this is the actual size that is stored in disk. So let's get into the read-only benchmarks, the GETs, so doing a GET from HBase. We're using 500 threads to read the data in the benchmark. Okay, so let me guide you through the group of results. Each of these group of bars is a different disk configuration from the four we have and each color is a different run, a sequential run, so this will be the first run for the baseline, second one and third one. One thing that we can see is that the first run is always a bit slower than the rest of the runs. So this means the cache was cold. There was no cache in there. But we can see that the second and third runs, they have the same times here. Higher is better. This is throughput in operations per second. And this yellow line you see here is latency and latency lower is the better. We can see that bucket cache of heap and run disk have very similar results. So either if you put it managed by Java or you just mount it into a file system, a tempFS file system, you get very similar results. And the bucket cache can speed up a bit more the results. At the end, the bucket cache gets 2x the performance benefit from the baseline and 50% more than the run disk strategies. So let's look at how this looks on the server level, on the performance metrics. So this is the average CPU for the baseline execution. Its vertical bar here is showing a different run. So this is the first one, second and third. You can see that it's pretty stable, except that in the beginning here, this red part, this is weight IO. The graph here at the bottom, it measures the read performance from the disk. And we see here a huge orange spike that gets to 2 gigabytes per second. And what is basically happening is that at the beginning of the run, files are being read from disk. We have the weight IO. But then memory is being filled in, either Java heap space, this blue space, or the OS buffer cache. So basically it's only reading from the disk on the first seconds of the execution, and then it's caching everything either on heap or the OS is doing that for us. Network is quite constant, network is not a bottleneck for any of these tests. So we were saying, okay, so we're not actually measuring too much the disk performance. Everything is being cached in run. So there are a couple of strategies that you can follow. Let's look at the, first let's look at the other examples. On the off heap, there's a write, there's a read throughout all of the execution. This is for the temporary tempFS run disk. And this will be for the bucket cache. In the bucket cache example that is twice as fast, you can see that actually the cache is being filled in as we execute the first benchmark. One thing is that we cannot, with these metrics tools, we cannot see the how memory is being written or accessed by the tools. And this is one of the improvements we need to make for all tools. But what is happening is that the OS buffer cache is being really effective on these clusters and we're really stressing the hardware. The challenge is to actually how we can benchmark the drive outside of the buffer cache. One thing is to, one strategy would be to build a very large workload. And we did some tests of building terabytes of data, but they take days to run. We cannot spend days just waiting to see if something finished and then days to process all of the data generated. So the approaches we took were limiting the available run in the nodes to lower the run available to the OS to 32 GBs instead of 128. And the second experiment is actually to drop the OS buffer cache every 10 seconds. So what happens here if we limit the memory, simulating that the cluster has lower resources, baseline and the run strategies, they have a pretty similar time while on the bucket cache where we have this external memory that is not in RAM, we get up to 8x, so 100% performance improvement if the nodes have less capacity. So if we look at the CPU charts for the four different strategies, we see the red here is waitio, so the bottleneck is reading from disks. But on the bucket cache example, as things are not so slow as reading from the j-bot and we have the data already in LRU cache, it gets to be 8x faster. The disk read throughput is around 2.5 GBs per second, stable for all of the three strategies except for the nbme where we get 38 GBs per second throughput for the whole cluster. This is aggregated throughout the cluster, so the read throughput is actually quite high. On the second example where we drop the buffer cache, we do get some improvements by using the RAM disk of heap and RAM, so we're able to measure this, but the improvement is marginal, it's less than 1x, 50%, and in this case the bucket cache has a performance improvement of 9x. Of course, these running times, the total running times are longer than on the first experiments where we had all of the RAM, but this is a way that has been useful to actually see what will be the capacity and what can we expect from the drives. Okay, the next set of experiments we did are related to running the whole benchmark. YCSB has several use cases. We were using Workload C in the first example that is read-only, but there's a read-update workload, mostly read-workload, update-workload, and read-modified type of write workloads. Since these workloads add new data to the workload, we cannot run them repeatedly, sequentially as we did before, because we're actually increasing the size of the storage, so they recommend them to run them from workload A to F. We're skipping workload E because it's quite slow as it does some scans and we couldn't finish in time if not if we would have run it. So very quickly on these results, when we were running the mixed workloads, we have three strategies, baseline, RAM, and bucket cache. Each line is a different benchmark, and just quickly summarizing on the speedup. On the data generation, there is no speedup by using a cache, it's write-only, and in some benchmarks, we get more speedup, but the most we get is around 2X. No, in this case, no, sorry, it's less than 1X. It's around 50% improvement by using the NVME drives on this scenario. Also with the RAM strategies. So we're actually not getting even 100% improvement if we're doing a more production-style workload where we're also adding data and updates. So on this set of experiments, we also limited the RAM to 32 GB per node, and the speedup was higher, as expected, but it was around 87% for the NVME case. This is the average of all of the different workloads, and we got an improvement of around 87%. So how does this look on CPU? This will be baseline, all of the different workloads sequentially, and this case will be the NVME. Waitio is the main bottleneck. If we look at the disk charts, we see what's going on here. Data is being generated, this is the blue part, and then all the orange area is being read from the J-Bot, or yeah, pretty much from the J-Bot, and the rest is in the OS buffer cache. On the case of the NVME, we do have more writes during the execution, and here's the updates on some of the workloads, but basically we get more constant read throughput and with higher peaks of around 25 GB per second. So actually the drives are being used, they're being used quite effectively, but the OS buffer cache is really helpful for HBase and HDFS in general. Here's just some charts with the network and memory, just to keep on the slides if someone wants to check later. The network is not the bottleneck, and it's highly used when generating the data, actually. Okay, yes. Okay, let's get to the summary and conclusions of this work. So with bucket cache, we've been able to increase the performance of HBase, especially putting the files into an NVME drive. However, it was surprising that the speedup was around 2x for the base cases scenario where we have the cache already warm and we're doing a read-only type of workload. If we're doing a general workload, we get 50% to 100% improvement, which is not that much, of course. If you can double the capacity of your cluster just by adding some hardware storage, then that might be interesting. But we see that as you limit more, as you stress more the resources on your hardware, the benefits are higher. We wouldn't recommend to run the bucket cache on-hp. We would recommend to do it always off-hp, either on-disk or in a different mount point. It could also be an SSD drive, regular SAS or SATA. It doesn't have to be NVME to get some benefits. Some of the things we have learned that is testing high-end hardware on the application level is not so easy. You need to generate very big workloads or try to limit the resources to find out artificially how much benefit you can get. But you still need to do per device, per node, and per cluster system level or micro level benchmarks. The OS buffer cache is quite effective. So if you add more RAM, you're actually speeding up your applications. An L2 cache with LRU benefits at some items are more popular than others. YCSB does long-tail distribution, Sipfian type of distribution, so their items are not normally distributed. You can get more speed up there. But the techniques we use to either drop the cache or limit the RAM are effective to test with this hardware. To conclude, I would say that NVME devices are fast. They have very low latency, but they are more expensive than putting a J-Bot. Big data applications are designed to work with sequential reads and writes. So until we actually write applications different and update the code to do random reads or even do byte address or treat the NVME devices as if it was RAM, not a block device as a disk, then we won't be using the full benefits of this write other than with the use cases of using caches. Right now, in the big data ecosystem, you need to rely on external tools or some research projects to actually speed up and tire everything, and instead of using it as a cache included in the storage file system. So yes, this is the first part of our work, and we welcome any feedback that you might have, and that will be all. I will leave some reference in those slides and thank you for your attention. Okay, I'll take some questions. Yes. Yes, we used the... The question was, we used a stress tool to limit the RAM. So it's basically a program that fills in the RAM to the amount that you tell it. We only use it to fill the RAM, so no CPU usage. There's another option suggested that is to actually put in the boot kernel parameters, the maximum RAM. This would have been preferable, but we didn't have the cluster on site, so we don't own the hardware, and I didn't want to be regenerating the intramount and FS. NBME or like RDMA? Okay, so the question is, if we looked into NBME over Fabric, so using direct memory RDMA over InfiniVand or Roxy. No, I have talked and worked with some of the papers. Ohio State University, Professor Panda, if you look, Professor Panda, look for that. They have some tests with experimental results where they do byte addressable, remote memory access to RAM devices. They show interesting numbers, but the use case we're looking for is more for production that we want to deploy soon, and for that you need to modify application or RAM, maybe not production ready code. That's the reason. The back first. Well, there's less storage in RAM. We're using 32 GB of RAM for that, and that is competing with the other buffer cache, so you have less buffer cache and less memory for the Java heap space, and in the other we have a per node 1.6 terabytes of torals. That was enough for experiment. Yes, the question was, we use XFS on the NVME drives. Actually, it's mounted as a software rate because the drive comes with two disks, so we're losing some performance already there on the software rate. Yes, there are other research, I would call them research or newer projects that improve standard file system disk. I think I named one here. There's one called SSDFS, which is a file system thought for flash drives. They also save your drives so that the bytes don't burn out so easily. Yeah, that's interesting, but I will call it experimental for production, in this case. Another question? Okay, we can keep talking later if you have more questions. So thanks all for your attention.