 Welcome to Bringing Compression to Postgres at Zero Cost of Arments, where our speaker will present a solution that allows Postgres users to achieve significant data storage savings through compression at zero CPU or performance cost. My name is Lindsay Hooper and I'm one of the Postgres conference organizers and I'll be your moderator for this webinar. I'm here with Tong Zhang, co-founder and chief scientist at Scale Flux. Tong focuses on commercializing computational solid state storage drives and he is currently a professor in the electrical, computer and systems engineering department at Renslayer Polytechnic Institute. Tong received his PhD degree in electrical and computer engineering at the University of Minnesota Minneapolis in 2002. Welcome to Tong. That's all I got. So with all of that, I'm going to hand it off to Tong. Take it away. Thank you. Thank you, Lindsay. It's really my great honor to have this opportunity to share our experience on developing a solution that allows Postgres to benefit from data compression and meanwhile without compromising its performance. So we know that Postgres on its own does not compress the table data, so unless the record size is large like two kilobytes. So as a result, the end users have to rely on the underlying storage hierarchy in order to bring the data compression into the picture. So apparently the first option is to run Postgres on file systems that support transparent compression such as ZFS and BTRFS. So we know that internally these file systems carry out block compression and stores each compressed block over one or multiple 4k byte sectors on the storage device. So this 4k byte alignment constraint causes pretty high storage space waste and the database the overall compression ratio. So in order to improve the compression ratio, we could either increase the block size for example from let's say 8 kilobytes to 32 kilobytes as a one compression block. So given this 8 kilobytes database page size, clearly a larger compression block size will result in a higher rate on the right, this rate on the right amplification at the file system level. So this will further lead to a larger database performance degradation. So to improve the compression ratio, another option is just to make those file systems apply more powerful compression algorithms. For example, instead of CPU Lite LZ4 compression, we could apply the standard or Zlib compression to improve the compression ratio of cost and as a cost of CPU overhead. So this will also lead to larger database performance degradation. And moreover, in addition to such a fundamental compression ratio versus a performance trade-off ZFS and the BTRFS, those file systems are far less popular than other like a journaling file systems like EXT4 and the XFS, but however those journaling file systems on their own does not support transparent compression at all. So to support data compression, another option is to use block layer with built-in transparent compression such as the VDO module in the latest Linux kernel. So here that operating underneath the file system, such block layer module transparently compress each 4K byte block and pack those multiple compressed blocks into one 4K byte sector. So compared with the file system level compression, the compression ratio of such a block layer module unfortunately tend to be lower or even much lower. So to improve the compression ratio, the only option here is to apply more powerful compression algorithm, which of course will lead to a larger performance degradation. So regardless of using file system level or block layer transparent compression, the system is always subject to a strict trade-off between the storage cost saving and the database performance degradation. So moreover, due to the inherent implementation constraint, neither the file system level compression or nor the block layer compression could achieve high compression ratio. So as a result today, like most end users, just typically just forget about data compression and just run post-grace on normal storage hierarchy and leaving those highly compressible data completely uncompressed on the storage device. And up to now, it seems like there's nothing we can do here, right? It's really, we have no other options. Actually this is where the hardware people can come in to help. So that is how that led us to make each storage device capable of carrying out hardware-based compression being completely transparent to the software stack. So this would bring the data compression back to the picture. And meanwhile, without suffering from the storage cost versus database performance trade-off. And such a storage device, we call that the computational storage drive with data pass transparent compression. So in the remainder of this talk, I will discuss the very basics of such storage hardware and its application to post-grace. So as we know that nowadays the CPU technology scaling is reaching to its limit. And the computing infrastructure is transitioning from a traditional CPU-only homogeneous computing towards domain-specific heterogeneous computing, where certain computation tasks kind of migrate from CPU to other computing engines, including the standalone PCIe accelerators and smart network card that offloads networking processing and also computational storage drives that offload tasks such as data compression and encryption from CPU. So altogether, they complement with CPU and form a truly heterogeneous computing infrastructure. So the basic concept behind the computational storage is very straightforward, very simple. We just make each storage drive be capable of performing certain heavy-duty computation on the IO data pass. And this will form a highly distributed heterogeneous computing infrastructure that can maximize the computation parallelism with minimal extra data movement and the latency overhead. So our company, SkillFlex, has been leading the wave of computational storage and we are the first to launch computational storage products into commercial market. And our current product, which is a special class of computational storage drive with built-in data pass transparent compression, has become GA early this year and being deployed worldwide and the application mainly around the database domain. So actually recognizing this trend of a computational storage, the storage industry associations near has already commissioned a working group in charge of this standardization where the SkillFlex is a founding member and we received great response from industry and the membership a member in this working group just keep growing. So in the heterogeneous computing landscape, we just talk about the regarding the commercialization of computational storage, the data pass transparent compression apparently is the first low-hanging fruit to pick and its basic idea is very simple. The compression is done in hardware on the IO pass completely transparent to the OS and the user application. Our current product, just to use a single FPGA to handle both flash control and per 4K bytes, they leave compression. We implement flash translation layer in a kernel space driver that's both the storage device as a standard block device to the Linux block layer. It's very easy to use. So this cartoon further illustrates its difference from the current practice. So on the left hand side is the current practice where we use either CPU or accelerator to handle data compression and deploy normal SSD. The right hand side shows the computational storage drive with data pass transparent compression. Here, there's a single FPGA combines the functionality of the flash control and hardware compression, decompression, and then it's just to provide plug and play solution to the users as the drive can free up the whole CPU cycles and minimize the data movement and enables the compression throughput to scale with a storage capacity. So really from the functionality perspective, the computational storage drive with data pass compression is a logically equivalent to the block layer transparent compression we just discussed earlier. So both of them just compress each 4K bytes user data being completely transparent from the file system and the user application. However, from the implementation perspective, the computational storage drive integrates hardware based Zlib compression, which can achieve much higher compression ratio at zero CPU cost. And moreover, it can tightly place all the compressed block in flash memory without any storage space waste. So this can further boost the overall data compression ratio. So yeah, this figure compares the compression ratio of our drive with several mainstream compression libraries like LZ4, Z-standard, and Zlib. So here we use a canterbury corpus file as a benchmark and set the compression block size at 8 kilobytes for all the compression libraries and align each compressed block to 4K byte boundary. So here the result, we can clearly see that the computational storage drive can achieve the best compression ratio, even compared with very powerful compression libraries like Z-standard and Zlib. So this size shows the basic FIO testing and our drive and competing high-end NVMe drive, both 3.2 terabytes. The FIO generally very highway IO workload across the entire 3.2 terabyte storage space. The three figures here show the random IO's when each IO request is 4K bytes, 8K bytes, and 16K bytes. So in each figure, the horizontal axis is a real percentage in the IO workload. So we can see that as the workload changes from the rate intensive to the right intensive, the IO's of the normal NVMe drive significantly drops. This simply because of their internal garbage collection overhead. And in comparison, in our drive, the built-in transparent compression can on the fly reduce the red data traffic, so leading to a much less internal garbage collection activity inside the drive. So as a result, really not surprisingly, our drive with built-in transparent compression can achieve much higher random IO's from this figure, more than like two times higher. So now let's see how the Postgres performs when we apply this kind of storage drive with built-in data pass transparent compression. So here we run five different six bench workload on both our drive and a competing high-end NVMe drive. So here we keep all the Postgres parameters as their default settings and the dataset size is about two terabytes. So even though six bench generates data randomly, like with relatively low data compressibility, the results do show that our drive can transparently compress the two terabytes dataset to less than 800 gigabyte, representing about 60 percent of the storage cost reduction. So and also this figure here shows the TPS of the five six bench workload. We can see that for right intensive workload, like update numindex and update index, both our drive and the competing normal NVMe drive have pretty much the same TPS performance. So actually at the first glance, this seems to contradict with the FIO random IO's comparison we just showed earlier here. Here our drive can achieve so much better random IO's on their FIO testing, but it does not translate or reflect on the Postgres TPS comparison. Actually the main reason is that the dataset size and the right IO intensity here are not large enough to trigger the garbage collection inside the normal NVMe drive. So therefore both drives do not experience the internal garbage collection, and as a result they tend to have similar performance under the right intensive workload. And meanwhile we can see that the really intensive workload actually have noticeably better TPS performance on the drive with transparent compression. So then where the gain comes from, actually the reason is that by compressing each eight kilobyte page it can reduce the probability that different read requests access the same flash memory chip. So that means we can reduce the flash memory die access conflict. This will lead to higher page read throughput and so that will give us the higher TPS under the read intensive workload. So then from this result we can show we can see that we not only can reduce the storage cost transparently and at the same time we can even improve the performance. And so that is a really straightforward usage. So beyond this use case actually we can go one step further to make post grids take even more advantage of a storage drive with built-in transparent compression. So first we all know that post grids use a parameter called the fuel factor to control the amount of space reserved in like in each page for future updates. Clearly the value of the fuel factor directly determines the trade-off between the database performance and the storage cost. Let's say like if we reduce the fuel factor to leave more space for future updates in each database page the database performance will improve especially under read intensive workload. But meanwhile the storage costs will also accordingly increase. So as a result post grids by default tend to set the fuel factor at 100. That means do not reserve any space within each page for future updates in order to minimize the storage cost. Actually like interestingly once post grids runs on storage drive with transparent compression the storage overhead caused by the reserved space will just largely disappear because the post grids initialize the reserved space as all zero and the transparent compression can highly compress the all zero segments. So this naturally enables post grids more aggressively set the fuel factor without sacrificing the physical storage cost. So to better illustrate this let's see this cartoon like like the blue and the black dots represent the operating points where when using the hour drive and the normal NVMe drive with the default fuel factor or 100 they have the pretty much the same performance and our drive can reduce transparently reduce the physical storage cost by half through the transparent compression. And once we reduce the fuel factor let's say to 50. So under the normal NVMe drive the storage cost directly almost like a double and also performance improves. So it's very clear trade-off between the performance and the storage cost. And meanwhile under our drive with transparent compression actually we can expect the same performance improvement but meanwhile the storage cost remains almost unchanged. So to better demonstrate this we further carry out the six bench TPCC benchmarking. Here we consider the two we consider the two different data set on the left hand side is a 740 gigabyte on the right hand side is a 1.4 terabyte. The the result shows that as we reduce the fuel factor from 100 to 75 the TPS performance will improve by about 33%. Under normal NVMe drive the physical storage space just jumps from 740 gigabyte to 900 gigabyte or like a 1.4 terabyte to 1.7 terabyte. But in comparison under our drive with transparent compression we see the same 33% TPS performance improvement and meanwhile the physical storage space only very slightly increased let's see from 178 gigabyte to 189 gigabyte or from 342 gigabyte to 365 gigabyte. So this study very clearly shows that by configuring the fuel factor parameter actually pulse grids can very nicely take much better advantage of the underlying transparent compression to further improve the TPS performance and without compromising the storage cost of saving. So actually to materialize the storage cost of saving for any users our drive can expose a logical storage capacity that is much larger than the physical storage capacity for example like a 4x2x4x or even more. Of course like due to the the long time data compression ratio variation users must be able to monitor the physical storage space usage and accordingly manage the data storage. So in in this context we provide two levels of support first we provide iocato and sysfs API for users just to runtime query the physical storage space usage this can be easily integrated into existing storage space management environment and also to make things even simpler we also provide a space balance or two that runs as a background demon to ensure that file system will never run out of physical storage space before using up the total logical storage space. So really up to now we have been kind of mainly focusing on scale flags computational storage drive that can perform hardware based transparent compression. Actually we are not alone like we are not the only company in this area like storage hardware with built-in transparent compression is now quickly becoming very pervasive for example like almost all the all flash array products from the Dell HPE and the pure storage the natively support built-in hardware based the transparent compression and the storage drive with built-in transparent compression are also being commercialized by cgate and many similar products are coming to the market very soon and moreover the cloud vendors like amazon microsoft they also started to deploy hardware compression capability in their cloud environment so this will make the cloud native transparent compression will be available in in the very near future so therefore now it may be a right time for the database community to study how the relational database could take full advantage of such new storage hardware so in the following I will just present two very simple ideas along this direction and would love to explore deeper engagement with postgres community so the first idea is we just apply a dual in-memory versus un-storage page format in postgres which can further improve the data compression ratio on the storage hardware with built-in transparent compression again the idea is very simple just the motivated by the column store so when a page when a database page stays in the database cache memory we simply just keep the conventional raw based format and when flashing a page from a database cache memory into the storage the database itself just on the fly converts into a column based format and apply some CPU light transformation like x or shuffle to each column to improve the data compressibility so that's a very simple idea so to demonstrate this simple concept actually we already used the inodb as a as a test vehicle and the result shows that we can improve the we can further boost the data compression ratio by another 40 percent at very minimal performance impact so it would be very interesting to see how this idea could be integrated into into postgres you know into postgres in the environment so the second idea is about to reduce the right IO traffic caused by the full page right in the right catalog in postgres so we know that postgres applies full page right to warrant to to enhance the reliability enhance the database storage reliability at the cost of higher right IO traffic so for right intensive workload this could lead to a very noticeable performance degradation so the idea here is that by leveraging the transparent compression we just simply pad zeros into the right catalog so that each 8k byte full page is always aligned with a 4k byte boundary without sacrificing the storage cost because we can highly compress all the zeros and meanwhile we can leverage the file range column feature in file systems to clone the 8k byte full page from the table space into the right catalog because right now the 8k byte in the right catalog just exactly cover two 4k byte sectors then we can use borrow the leverage this file system level column feature so although the file systems like zfs and btrfs can support file range clone from the very beginning the journaling file system xfs just started to support this file range column very very recently so that really enabled a very nice opportunity so this would make it possible to to realize the full page right simply through a right range column instead of a physically right the whole 8k byte data into the right catalog so as a result we can completely eliminate the right this the right IO traffic caused by the full page right in the right catalog this not only can not only can help to increase the flash memory endurance but also can improve the database performance dramatically especially under the right intensive workload so that is another kind of a very simple idea just to we can very nicely leverage this transparent compression and the file clone from the file system level to to to keep this a full page right feature and without triggering the physical right traffic okay so in conclusion the emerging storage hardware with built-in transparent compression is indeed is a perfect match with postgres and without changing a single line of code postgres can very nicely benefit from such storage hardware in terms of both storage cost and performance moreover like if we're waiting to slightly modify the source code there could be a much larger spectrum for postgres to take even better advantage of such storage hardware and we just present two simple ideas as just as as simple examples so again that as skill flags we we really sincerely look forward to working with postgres community to explore how postgres could take full advantage of such new storage hardware so I would say there's future opportunities so with that this ends my talk thank you very much I would love to answer any any questions okay first question has come in have you done any tests on distributed databases no not yet we our our testing were done by just a standalone like a single server like a myc code postgres you know oracles but I would guess that regardless of either whether it's a distributed or not as long as the users have the demand to save storage cost and the result impacting the performance then our drive should should be able to help but still we haven't done any concrete testing on the true distributed distributed database thank you another question are the two ideas for changes to postgres complementary to the compression uh yes yeah so really those two ideas we just presented they're just purely in the context of the postgres it's really the the objective is just to take better advantage of the underlying compression capability built in in the storage hardware so it's just try to take advantage of the existing transparent compression capability great what is needed to engage for a poc oh yeah that's actually we're we're very open to any potential poc just should ask an email then we can arrange this poc either we can just we can set up an environment in our server room then the users can just remotely run run their testing on our drive or we can provide a drive to to the interested parties then they can just run in their own environment either way so um now again the drive our our product is already kind of a ga and we are under a very active pocs like over 60s pocs all over the world and mainly in the in the database community and in the in this database domain so we have received a very very positive feedback from many like end users like from hyperscaler to enterprise they see it's really kind of a saving cost and improving performance and without any overhead on the cpu so it's really kind of a free almost like a free launch so yeah we are really welcoming any kind of discussion on the pocs or collaboration great and have any third parties replicated your results um yes yeah so well actually we are we already have one independent any uh consultant in china actually he uh he run the testing on on their environment and actually they even show even better performance improvements than what we have run in internally and they already publish some blog and they put some materials online and also we are also engaging with some other other third parties like they register run testings and they will publish their result and publish in their blog very soon fantastic okay um so now with that I want to thank you tong for being here I want to thank everyone for spending a little bit of their Wednesday with us um and I hope to see you all at the next post-gres conference webinar have a great day yeah thank you