 Hello everyone. Is it working yet? It isn't. Ok, just for the recording. Hello everyone, sorry for the delay. I am François Tijot and I will talk about PostgreSQL and how we use PGBench to benchmark, to check and improve the scalability of the Dragonfly BSD operating system. So I'm an independent consultant and seasoned mean, a former CCTLD engineer. I work in the .frg3. I have been a BSD and PostgreSQL user for a long time. I have introduced 3 BSD for the name servers of the .frg3. And I've been a Dragonfly developer for a few years now. Dragonfly BSD is a Unix-like operating system based on 3 BSD. It was forked in 2003 by Mathieu Delin from 3 BSD4. There were many goals at the beginning. The most important ones are about performance, multi-processor scalability. And for that, Dragonfly uses per-core replicated resources and messaging. It is a bit different from most of the user Unix-likes operating systems. It doesn't use complicated algorithms. And by using replicated resources per-core, per-logical core on high-persuaded machines, many operations are naturally lockless. Dragonfly has a few innovative features which can be very useful for some workloads and in particular database loads. For example, we have SWAP cache, which is a second-level file cache using the SWAP infrastructure and optimized for SSDs. This is a relative PostgreSQL performance with and without SWAP cache. The blue curve was registered on the workstation with a limited amount of memory. By someone wanting to map and process maps for a good chunk of the North American continent, he used the OpenStreetMap project, if I'm not mistaken, and his machine just couldn't keep up. It didn't have enough memory. But by using SWAP cache, he was able to just buy a small SSD and process all the amount of mapping data he wanted to with a much less server performance degradation. Basically, he avoided to buy a new workstation and only had to spend about $300 instead of $3,000. I will now talk about the PGBench usage and what I did historically in details. The first PGBench runs were made in November 2011. At the time, Dragonfly, the last stable release of Dragonfly, still had a big multi-processor lock, and we just removed it in the development version. I had the occasion to use a dual-Z machine, which could run 24 hardware threads, and I was looking for benchmarks which could show the improvements in multi-processor scalability from the stable version to the development version. So PGBench was a good fit. PGBench is PostgreSQL specific. It can be used in different ways. There is a read-only workload, which is very useful for showing multi-processor scalability, and this is the one I used. It doesn't touch file systems. Well, it does use file systems, but it doesn't touch disks directly. All file systems operation can be made from memory, and we don't have the problem of fire bottlenecks polluting the benchmark results. We had some problems immediately from the start. At high loads, the Dragonfly kernel crashed. We had some weird bugs, named cache bugs, for example, where some files were supposed to exist, but couldn't be found by PGBench, for example. These bugs were fixed quite quickly, generally less than a day. It was mostly a matter of writing the right lock directives in the kernel. We had deadlocks in the VM subsystem, overflows in memory allocation subsystems, races. All of this was generally locking problems, and were exposed by the removal of the big multi-processor lock. They were relatively obvious to fix. So, for comparison purpose, I also used Scientific Linux, which is sort of a reddit enterprise derivative. So for several workloads, I thought it was a good fit. We have Dragonfly at the bottom, the stable version, and Scientific Linux, which is obviously doing much better. We immediately found many bottlenecks, and changes were made just in the days following the first results. We found we had another big multi-processor lock in the system 5, CMA4 code. System 5 share memory segments are used by Postgres, were used by Postgres in this version. So this was a very important fix, that subsystem. We had bottlenecks in select, pull, many performance issues in memory allocation path. We had to fix the VM subsystem, the virtual memory subsystem, to improve the number of page faults we could process at one time. Postgres used this big memory, a big share memory segment, so it exercises memory allocation code passes, virtual memory code passes, and we will keep finding problems in these subsystems for a long time. So after the first improvements, Dragonfly to the 13th year, sort of intermediate between Linux and the original stable version, the new development version of Dragonfly could more or less scale according to the amount of physical cores. Still had problems after some time. A few years later, I had access to another dual-zern machine. I decided to run new benchmarks with a new Postgres version, new Dragonfly versions. Basically, I ran benchmark, found bottlenecks in the kernel. Other people fixed them or made tweaks to the kernel. This was a constantly running process. We were communicating on IRC manually, so as not to lose too much time. But it still took a very long time to run all the benchmarks and do all the improvements. So at first, we had Dragonfly 3.0, which was already using the improvements from the previous development version, but it was much, much worse than a Linux-based operating system used as a reference. One of the problems was the new version of Postgres QL used a different shared-memory allocation technique. It didn't use system 5 shared-memory segments anymore. Postgres now used Mmap. Nobody loves system 5 shared-memory. Many operating systems still have defaults straight from the 1980s, whereas you can allocate more than a few megabytes of memory by default. So people have to tune their system to use this control, possibly recompile the kernel on some machines, some systems, and this is a big mess. That's why Postgres people are trying to use Mmap by default and move away from system 5 shared-memories. After some time, we still identified MIDI bottlenecks and improved kernel performance. So Dragonfly 3.2, which was the next stable version, was much improved and more or less unparred with the reference Linux-based system. For the details of the improvement, we had one bright summer of God student at CPU topology awareness to the kernel. The old BSD scheduler was changed to take this information into account. Knowing the machine topology is very important on new multiprocessor systems, new MP machines are not symmetric anymore. They are new MAP for short of non-uniform memory access, part of the memory is allocated to one processor, part of the memory to another one. The different processors communicate with IOS subsystems or memory differently, and most importantly, each processor, or each core in a processor socket has its own individual hardware resources. So we have to know when to migrate processes or when not to migrate them so as to keep caches hot. And this is very, very important for performance. And we found out the original BSD scheduler event when it adds the topology information still adds significant bottlenecks. It itself was single-faded and had to be rewritten. So Madeleine wrote a new scheduler. Schedules processes as close as possible to the place where last one on to keep caches hot, to also keep translation looks like buffers hot. TLBs are sort of caches for the virtual memory hardware subsystems. So it's very important when you use huge amount of memory to keep these caches as hot as possible. Also on the other hand, we also try to avoid unnecessary competition for resources. When we have two processor sockets, huge amount of cache per socket and only two Postgres clients, for example, it doesn't make sense to run the two Postgres clients on the same socket. It's best to run one on the first socket and the other one on the other socket for you double the amount of cache memory effectively used. So there are many trade-offs we have to know, we have to make. This is the same kind of problem between individual cores on each processor socket and hyper threads on the same CPU core. So we have to keep caches hot, then also globally balance the load and try to avoid hotspots and performance bottlenecks on a particular hardware resource. There were also some improvements. We found that many default values were tuned for 32-bit machines, in particular the amount of file cache memories we could use and so on. So we changed that. Virtual memory was also a big problem because it itself has to use page tables, which are in-memory descriptive resources used to describe where a virtual address points to. So by using, for example, 32 gigabytes of virtual memory, we have to manage possibly hundreds of megabytes, even gigabytes of page tables just to describe where in the physical world these virtual memory pages are located. So we had to do something, and that something was called PMAP MMU optimization. Basically, the channel tries to keep some of the page descriptive tables common between different processes. I'm not sure if I'm very clear on that part. We try to avoid duplication in the page tables, possibly avoiding the use of gigabytes, even tens of gigabytes of memory when you have a huge PostgreSQL share-memory segment. And finally, we also found out we could directly read file data from the VM cache system. So that's a so-called read shortcut. This graph shows a few of the improvements individually. So this was the best Dragonfly 3.0 performance by using shared page table information. We already improved performance tremendously. Reading file data directly from the virtual memory cache helped us improve performance even better. And finally, the new scheduler is a green curve and gives us something which was more or less equal to the Linux performance. We also found out that there was performance improvements when not PostgreSQL specific, and many workloads were improved. In general, the number of multiprocessors, virtual memory validations was much reduced. So the bigger the machine, the less it has to wait for other processors to process virtual memory mapping information. File operations were much improved with a red system call. Generally, the more the machine was loaded, the less it had to wait. So we had improved performance and load. I recently had the occasion of running PostgreSQL benchmarks again. This time it was a 40 hardware thread machine, also dual VM, in March of this year. We didn't really do any PostgreSQL specific performance improvements. This time it was just to check if Dragonfly was still performing adequately with recent PostgreSQL versions. We found out we still had some improvements, but this time they were most likely caused by improvements we had to make for running Putrier, which is the free BSD-based package building system created by Baptiste, which is here. This is...Prier is very preserved, fork-exec, and higher-intensive. It really exists all kinds of KML subsystems. So, we found out we were no better than Linux for much of the curve. Dragonfly is in green. For reference, you have two Linux-based operating systems, Libyan and CentOS. Dragonfly is faster than Linux and scales better than Linux as long as you have available hardware resources. Performance degradation is a bit more severe than Linux. Once you have more PostgreSQL clients than hardware resources, but this is known. We wanted to keep interactive performance, to not degrade too much interactive performance and all that. So, this much severe degradation than Linux once hardware resources are going to avoid... Well, it's a consequence of wanting to avoid waiting for 15 seconds after you have typed a key, for example, and see the result on the screen. So, this is a compromise. So, I'm done. Do you have any questions? So, the swap cache, that is more of a file system cache so that when you have a spinning disk, instead of reading it again from the spinning disk, you actually read it from the SSD as opposed to just putting swap on it, correct? Is that more or less the idea? Well, a swap cache is a sort of second-level file cache and it uses a swap infrastructure. It really puts cache data in the swap area. It can be used with regular hard disks. If you're really desperate, you can improve performance a bit, like if you add more disks in a red spool. But it's really optimized for SSD. It has a right amount of target. It tries to not write too much data at the same time, so as not to wear out the SSD. And once the cache has been populated, it starts to read back pages directly from this cache instantly. So, yeah, the idea is really to use it more as a read cache and not to wear out SSDs. You can have 100 gigabytes of second-level file cache that way. So, kind of similar to the L2R on ZFS then? Well, I know the implementation is differently because it's part of the operating system versus... Yeah, well, ZFS does things differently. If I'm not mistaken, ZFS uses SSDs for write caching. So, ZFS has both L2R, which is effectively a read cache for the main dataset, but then there's also the ZIL, which is the intent log writing, which is separate. So, you actually have both effectively a write cache and a read cache through two different mechanisms. Yeah, so that one is mostly a read cache. You can do almost everything with it. We have specific file attributes you can use to... If you want to decide to put in cache only a subdirectory hierarchy, you add this file attribute to the directory you want to cache. You can try to only cache file metadata. Can cache file consensus? Well, all kinds of variations are possible. This is controlled by SysControls and file attributes. So, I'm going to get a bit sidetrack, obviously. As you mentioned, Pudria workloads. Did you notice anything really interesting? I know that it's all over the place, but is there something which is more obvious to optimize looking at Pudria workloads? Well, the problem is Pudria exercised all kinds of canal subsystems at the same time. And we add really, really web bugs. For example, we had a rest condition in the TTY subsystem. Just by trying to do an LS or print what was happening on the screen at the same time Pudria ran, we had a canal panic. Yeah, and I think most canals have these kinds of bugs, but only Pudria was able to exercise so many subsystems at the same time so as to make them obvious. Most of the problems were in the virtual memory subsystem, the IO subsystem, file reading, mostly reading, yeah. Directory access, name cache. Multi-processor by itself was already a problem. Having so many processes run at the same time, we had locking issues in some part of the canal like really exercise system calls. For example, we had read, I think, without select or some common subsystems which had locking problems, but no program was able to make them obvious before Pudria. It was everywhere. And the bulk of the bugs and issues were in the virtual memory and name cache subsystem if I remember incorrectly. Welcome. So I know that this kind of benchmarking is quite time-consuming and requires a lot of effort, so I just want to have a two-part question. The first part is I'm wondering if you intend to do another round of investigation of Postgres specifically at some point in the future. And secondly, have you given thought to methods of automating this sort of benchmarking or some sort of way that it could be run on an ongoing basis? Because I think one of the tricks is it may not be... If you're not directly following an upstream project like Postgres in this case when the M-Maps switch changed, it may not be immediately obvious, oh, 9.2 to 9.3 may be an important point to do a huge benchmarking run, right? So I'm not sure how to effectively find out when these sorts of progressions happen against upstream projects. Yeah, that's the problem. Well, actually automating this benchmark and running it continuously is not difficult. I already automated it when I collected the data points. Each point was made from three different measures and I had to run the benchmark many times, at least three times for interpreting system and each point. So I created Shell Script and ran it automatically for the entire X-axis. So automating this kind of benchmark is not difficult. The problem is more about big machines availability. Every time I ran this benchmark, I had a dual ZN system available for at least a week before it could be put into production. So that was really an opportunity. We could do run this kind of benchmark continuously, but we had to buy a machine especially dedicated for it. Okay, but the installation of the OS and everything was still relatively manual for your testing? The installation of the OS is not a problem for we can prepare a disk and just plug it into the machine when you want to run the benchmark. So this can be done offline in a way. Part of the Risen and only run read-only benchmarks is trying to use disks for measuring post-BSQL, read and write disk performance would have taken much, much more time. Any more questions? Thank you.