 Okay, hi everybody. My name is Thomas Monroe. I'm gonna be talking about postgres and free BST together And I got to stay on the side of the line The strange title of my talk is some kind of attempt to capture What it felt like when I realized that you could actually hack the kernel and make useful changes to it and you know after spending a Lot of time writing user space code Realizing that the kernel was actually just more C code and you can hack that too So just a little bit of background about myself. I'm pretty new to free BST hacking and Mateusz Guzik and Ellen Judah my mentors I've spent about 20 years working on all kinds of CNC++ stuff on many different kinds of Unix So I'm not at all new to Unix hacking, but free BST is something I took up about three years ago and My main work is actually postgres and I work for a company called enterprise TV where I work on query parallelization and Transaction machinery and all kinds of stuff So one of my personal goals is to try and make free BST and postgres a really good combination so just a quick bit of background about the way that BST and postgres are interlinked and their common history at Berkeley, which I think is really interesting Michael Stonebreaker who started the postgres project before that he did Ingress and Ingress was a database that was developed at Berkeley around the time that BST was being born just as Ken Thompson came to Berkeley University with a suitcase full of tapes I assume with 80 into Unix on them and the whole BST thing kicked off These database guys were Taking the ideas from IBM system our projects relational databases and so on which ran on mainframes and Trying to do the same kind of stuff on Unix systems these new Unix systems that were suddenly everywhere at Berkeley so I Mentioning risk because it has the strange honor of being the first software that was ever released entirely under a BST license because BST itself was You needed to buy an AT&T license for part of it which costs like loads and loads of money, but Stonebreaker and his collaborators were trying to get the software they developed out of the university So that they could commercialize it on the side and they used the BST license to do that And then postgres finished up with it was a completely new database which also used to BST license and Ran on and was developed on BST systems Yeah, I think that's kind of interesting Modern postgres runs on all kinds of Unix The funny thing is there aren't that many kinds of Unix left because Like I can think of three Unixes that I worked on for years that are just gone. It's it's crazy But anyway, and that was one of the reasons that I became interested in free BST You know, I don't really want there to be one monoculture of computers remaining on earth. So Okay, so if you work on any kind of software that has to run on many different operating systems Operating systems look to you like this there a bunch of man pages that are essentially chiseled in stone and you know, that's the way the system works and and you've got to work within it and Things get much more interesting once you realize that you can actually change the operating system to make it better And that's the case with free BST I think probably more so than than other operating systems certainly than commercial ones and possibly more so than Linux So Some of the dilemmas that we database hackers face boil down to Variations of this sort of general theme should the operating system do something for us. So should we do it ourselves? So that there's all kinds of scheduling problems Buffering page caches all that kind of stuff. Should we do our own system like that? and So some of the things I'm just going to go through a whole To make postgres and free BST work together really well. I think there's a lot of really small things So I'm just going to go through kind of like a whole jumble of different small topics things that I've run into in my three years of using Free BST as my main development environment One thing that I think is really interesting to look at is The question of Buffered I owe postgres is basically the last major relational database to do buffered I owe and The most of the Unix operating systems developed some kind of direct I owe over the past sort of 15 years or so and Linus Torvalds was really against it and this was happening around about the time that DB2 and Oracle were switching to direct I owe on Linux And that's that quote from the top there is from from the open man take man page on Linux. I think I've changed it now but This is sort of tension between the operating system people and the database people both of whom want to have complete control of Buffering which I think is really interesting and we'll come back to this So to an operating system postgres looks in very, you know, simple cartoon form We use a tiny bit of system 5 shared memory a big chunk of M-map anonymous shared memory that some all the processes involved in the database cluster inherit and we also use POSIX shared memory to Run parallel queries, so I think that's quite strange to begin with we're using all three different mechanisms for getting shared memory we have a cluster of processes running together that are in the traditional Unix form you have Some kind of parent process we call the postmaster which is forking forking children So this is very classic old-school Unix software server software, and then we have a bunch of files on disk representing tables or relations and representing Segments of the right-ahead log that we use to do things efficiently The process is involved in running postgres have a File descriptor pool Obviously, there's a limit on how many file scripts you can have open at once So and then they have this this pipe that run connects to the postmaster so that they can Shut down quickly if they detect that the postmaster has died if anything has gone wrong And all versions of postgres used to have ways of crashing that would leave all kinds of funny processes You'd have to kill if things went wrong. So this new system keeps things tidy So that's a very simple sort of cartoon view of the operating system level objects involved One interesting thing is that we're still using processes and if you go back to 1989 which is close to the beginning of postgres' Creation was originally created in 986 So 99 is pretty close to that. They were saying that they used Processes instead of threads because they were trying to get something going quickly and back in those days Of course threads worked rather differently on each operating system. So they hadn't quite Somehow 30 years later. We still haven't quite got around to doing the threads So from a system cool point of view what you see when you run postgres is idle backends sit in pole and when you run when you run very simple queries that don't have to Where all the data that they want to use is in the postgres page cache Sorry a buffer pool. You really just see a bunch of network operations And when we need to pull things into our own buffer pool So we have this buffer pool of our own and then there's the operating system page cache So you've got this kind of two level of buffering And you can see that you can see data moving from the Operating system page cache into postgres' page cache with reading with read operations Now one interesting thing is that the when we When we do write operations Hopefully we just write sequential data into the writer head log So you see a lot of writing and syncing going on in these in these wall files But not on the actual data files themselves that represent tables This is one of the tricks that databases use to go fast And then we have this checkpointing thing which happens from time to time where data gets flushed out and written to disk And that's where things get f-synced I mentioned all of this because I'm going to be going to be talking about some of these individual points a bit later I just wanted to give a very quick overview of how that looks to the operating system. Okay, so Kibb Constantine Belusov wrote a paper Or a descriptive sort of a report on a bunch of work that he did in 2014 when it was reported that postgres was running really slowly on big free BST systems And that's a really interesting thing to read and that's one of the first things I read when I started Using free BST for my daily development The main story there was that in postgres 9.3 We stopped using one gigantic system 5 shared memory region and started using an inherited Anonymous shared and that region and for some reason that took a while for people to figure out performance just completely tanked and it turned out to be because that there's this CIS CTL where you can tell it to use physical memory for System 5 shared memory which turns off a whole bunch of which turns off the ability to swap it out It becomes physical memory and for reasons that I don't personally understand very well because they've mostly been fixed now If you didn't use this option then you get all these PV entries And apparently there's a lot of contention on those so systems that had lots of Cores that were very busy would contend and things would go very slowly So lots of problems have been lots of improvements have been made since then to improve those problems And I haven't quite figured out whether they've been entirely fixed or not I tried to do some testing and I found that it was Still possible if you tell postgres to use the old system 5 shared memory region you can see that there's a Light blue line there which corresponds to using system 5 shared memory and Using this CIS CTL to tell it to use physical memory for that so it can't be can't be paged out I could get it to go on a on a this is on a 40 CPU Amazon image M4 X10 large you could see that I could get it a Little bit more. What is that like 5% more or something performance out of the system? I think that's kind of interesting I haven't I need to do a bit more digging and find out if that was You know a real phenomenon or not so for that reason I'm actually we're actually going to put the ability to put system 5 Shared memory back into postgres we took it out at 9.3 I think we're going to put it back in quite soon because it seems that there's something still here We should probably dig into free BST and find as well and find out if that can be fixed Okay, so the first thing that I tried to do when I first When I was looking for things to do and on on free BST that would make postgres work better there was no F data sync and I tried to get this working and I was kind of had something almost working for UFS This is F data sync means flush the written data, but not the file metadata So things like modified time and so on which if you're writing a whole bunch of stuff to the writer-head log and And and calling F data sync if data sync Requires one random my own set of two because you don't have to you know go and change the modified time and so on and I was I had something kind of working, but one day I updated my source tree and Cognac music and others had suddenly implemented that and there it was and it was working So and that was probably a little bit too hard to start with anyway, probably modifying the file system is probably not a Good first project for learning kernel hacking But that that's something that came along and in free BST 11 and made postgres faster on on UFS at Committing transactions, which was good So then the next thing I tried to improve was Set prop titles so one of the things that kid wrote about and that report was if you look at postgres in top or in You know any tool H top or anything that shows the process titles you can see See the words update and select there you can see what each process is doing It's got these little tags that it shows there But if you're doing loads and loads of superfast transactions, it turns out we call that you know possibly tens of thousands or hundreds of thousands of times per second or even a million times per second on some systems and Turned out that free BST was the only system that was using system calls to do that Every other operating system that I looked at including the other BSTs and Linux Just had a piece of memory and you can update the process. I was just by writing into that memory So I thought well that looks like something I could handle So I dug into that a bit and I found that actually the code to do it that way was still there in free BST At some point somebody had added the system call type interface Which has some advantages and some disadvantages But it didn't work so well when you call it a million times a second or a hundred thousand times a second. So I Introduced a new mode that would Do it the old way just update some some piece of memory and that has some different problems Like in theory you could see a torn torn version of half one string and half the other string That never really happens. I mean you don't normally write messages that are bigger than a cash line or anything anyway, so Kid convinced me to give it a different name. So you've got set product title underscore fast And the result of that was that On a 40 core Machine I could get 10% more transactions per second Which I was pretty pleased with for a very small patch and that was not the first thing I got into free BST I Think there's a question of whether we should just only have that kind of set prop title Like do we really need the other one? I don't know, but I'm I'm new here. I'm not going to argue that Oh Well, well, it's it's I think it's nice to be able to see what the system is doing You don't need to and you can turn that off and in fact the post-dress port Turns it off by default But I think that makes the experience not quite as nice as using it on other operating systems So that's why I wanted to fix it. It's a very nice simple way to see what the system is doing I like it. So I don't like to turn it off. So yeah, okay. So the next thing I worked on was So on I ricks Originally, they were the first to do this. There's a way to ask for a signal when your parent dies, of course, normally in on any unique system you can get a signal when a child dies but In traditional unique systems, there was never a signal when your parent dies I don't know why it seems incredibly useful to me But the iris people did that and then the the Linux guys copied that So I Didn't just copy this from Linux. I copied it from iris That's what I'm trying to say So this was useful because I mentioned that we have this this pipe between every child process and the Postmaster the parent of all these processes because we want the whole system to shut down cleanly if somehow that something bad happens to the parent process and that means that we Anytime we're doing any waiting we We with like pole or select or whatever we have that pipe there But it also means if we're doing some some kind of busy loop and this happens mostly when you've got a Replication streaming replica server, which is just slavishly following along Applying everything that's happening on the on the master server that's like that can be a really busy loop and If you're never waiting There's no efficient way to find out if the parent has died So we finished up putting this thing where we'll try we'll try and read one bite from there from this this pipe from the Parent process every time through a busy loop, which turns out to really suck. So I mean performance wise Somebody figured out that that was that some Streaming replica servers were spending up to 20% of their time doing that which is completely insane and so we Got on Linux first of all we used set p death sake to just to get a signal instead We just don't need to do that if you can get a signal instead. You just never need to pull this pipe all the time So that seemed like a reasonable thing to try and do in Free BST and it was relatively simple. I mean, it's just a matter of Recording in the proc struct that you would like a signal and whatever a process is exits anyway You always have to scan through all the children To re parent them right so there was already a loop there So I just had to make it deliver a signal if you asked to one so that was relatively simple piece of code and that Made recovery and replication and measure would be faster Cool, and that's in free BST 12 Another thing that was fixed recently not not by me, but this was a Long-standing problem in free BST jails was that System five-shared memory wasn't jailed properly which I found kind of scary and strange I mean how it was like that for years apparently until BBC 11 So you had to if you were running multiple postgres surveys in different jails You had to do some strange tricks like you had to make sure they were using different ports or different uses Because the user you ID and the port Appeared in shared memory in a place and they could see each other shared memory Which is like not what jails are supposed to do right so that was fixed and so that was a good thing And probably probably probably has good security implications for other software as well, but I don't know Another improvement in A recent version of postgres we we switched to using ePoll on Linux and So that should imply that we should be using KQ on on free BST and so on So I wrote a patch to actually get postgres to use KQ And K event to wait for things to happen because quite often we're waiting on three Three fd's we're waiting on a like waiting for the client to send us a query We're also waiting on that that pipe that tells us if the parent process has died And we also have a self pipe that we need to deal with signals coming in So we're quite often waiting for three file descriptors, and if you're doing that over time with poll You need to acquire locks on all those files on all those Objects and in the kernels so it stands to reason that if you had your own KQ file descriptor it should be much more efficient right and I found that it was much faster in some cases but in some cases it was slower and Furthermore net BST users reported that was terribly like it was much slower, and I don't I don't know anything about net BST So I and I didn't try to analyze that myself then I thought well if we if we just have this patch in Postgres that uses KQ we don't we could disable it for net BST until that can be fixed and just use it on free BST or something like that It also goes a lot faster on Macintosh is not that anyone runs Database servers on Macintosh is but it was interesting to have another implementation of KQ and see that it went better but for some reason some workloads became slower on Free BST, and we still need to figure out why exactly it seems to be something to do with the Wake-up logic and the scheduler We just get to the bottom of that because in most cases it makes things go better. So that's something that we haven't managed to get in there yet Okay, another thing which Would be good to have in free BST is sync file range. That's a system call on Linux, which is It's sort of a hint. It's not like f-sync exactly. It's more like Please start writing back this range of the fire. Well, it's got a few different modes actually But the basic feature that we want is more fine-grained control over F-syncing because we actually have a pretty good idea of when our next check point is coming up And that we have better information than the kernel does about when to write data back So we'd like to tell it to start syncing ranges of a file that contain dirty data That's a feature that's used by loads of different databases. So It's not just us. It would be probably quite a good idea to have that. There is a PR for it It seems like it should be easy, but I'm I'm still slightly frightened of hacking on file system codes. So Yeah Something that happened last year in the postgres community was what we jokingly referred to as f-sync gate somebody figured out that a very popular operating system that isn't free BST had Strange semantics for f-sync in particular around error handling that as far as I know almost nobody knew about I mean people obviously Linux kernel hackers knew about this, but The most surprising thing was if you call f-sync and it says EIO there was some kind of error in the IOLA It doesn't keep your dirty buffers around anymore. It just throws them away Which is kind of strange because that seems to be its job right to not do that So that has the strange consequence that if you call f-sync again on that operating system It can say okay because now there are no dirty buffers because last time it threw them away So and postgres as well, but it's not just postgres also my SQL and MongoDB and a bunch of other software That was trying to be reliable would be prepared to retry the way that postgres works It has these periodic checkpoints and the checkpoint would fail It would log lots of nasty error messages But a bit later the next checkpoint would come around and maybe after a few seconds it will try again or something like that And then it would say success so if you're trying to shut your database down and When a checkpoint succeeds because it managed to f-sync all the data It then throws away the wall the log that it could use to rescue that you know It's thrown away now all copies of the data. You know, it's this is a complete disaster, right? So one question is are there any cases where there's an IO error, but then it fixes itself so Does retrying actually make sense at all? That's kind of an open question probably Probably for a long time there weren't such cases probably when you're disc fried at least back in the old days It probably didn't fix itself again that might have been unlikely. I don't know but now in the modern time We've got all kinds of thin provisioning and remote Storage systems and there aren't transient failures. So This is a real real problem. So we did the only thing we could we could really do Which is if you see any error from f-sync you just have to panic and just completely shut down because the whole world is insane after that point We don't we have no way to know what you know I mean if people start throwing away dirty data from the kernel buffers It's kind of game over right so that's what we did and I know that the other I know that my SQL and MongoDB made the same choice They they now panic if this and IO error actually I actually put a setting into Postgres so you can turn that off because Freebc doesn't have this problem and there's a commit from 1999 where you can see this problem was discussed and and Freebc doesn't throw away Pages in the page cache the kernels buffer if it couldn't write it back which makes sense I mean it makes intuitive sense the data is still dirty until it's been written on the disk. It's dirty, right? I should add that none of this applies to ZFS which has its own Completely independent Caching and and so forth and it never throw it never throws away dirty data that hasn't written to disk It has an option to panic or retry and so on Yeah, so that was an interesting case where Freebc did very well and came out looking quite good I think It's also interesting to note that that choice to throw away dirty data buffers when you couldn't write them back Was in ancient Unix, so it was probably sort of inherited by everybody without much analysis. I think until Freebc changed their mind on how that should work okay, so Before Freebc 11 there were There was no support for collating unicode in Freebc it would be I think it was if you tried to do it it would be collated in binary order or something like that But in Freebc 11, there's a new there was a new Collation implementation which is shared with I think the code came from maybe a Lumos or maybe Dragonfly BC one of the other But somehow those people working on those three projects work together anyone here involved in that Don't want to be preaching to people who know vastly more about it than me, so So That's all Freebc 11. Yeah, you can actually do collations properly properly in with unicode by which I mean that the stir-call function in libc can compare two Strings of unicode and compare them correctly using the right rules for for a given locale for a certain language or something and Postgres uses that a lot. We're really big on using collations because people use text in indexes and indexes At least speech reindexes are entirely based on comparisons, right? Now because before that Freebc didn't support collating unicode there was actually a patch in the in the Port of Postgres for free BST, which made it use ICU instead of libc, which was really interesting Postgres itself is now slowly gaining ICU support as an option and I'll come back to that in just a moment, so One of the problems we have with Postgres one of the ways that we get corruption Is when the collation rules change? So if you build a B tree which is an index it's entirely dependent on the the the rules not changing the sort order for Independent keys not changing, but of course they do change They change because there were mistakes and they change because of all sorts of reasons. I mean they changed because There was a case about two years ago where the glibc guys Rolled out a new version that changed the order of a lot of German words and all kinds of databases in Germany using the German locale Had corrupted indexes and what that meant is they couldn't see like their query started giving strange answers because they Couldn't find keys because you know something that should be over here was now over here in the tree whatever and It's a complicated business. It's not exactly clear what you should do about it I mean that what what a lot of commercial databases do about this is just don't use operating systems to do that They have their own collation system and that would be equivalent to us deciding to use I we're not we're not gonna write our own one. It's too hard But we might use ICU from IBM to do this, but then why should you have then I mean With my free BST hat on I Don't want every application to make its own solution to collating text because now they'll disagree with each other Which is not good. I also don't want to just give up on the problem because it's too hard. That's stupid, right? I mean The libc provides stir coal it should work in a in a usable way So what I think we should do is provide a libc interface to ask for a version string for the collation definition and Guarantee that we will change that version string so you can compare it and if it's not equal then You know that the rules have changed in some way So then at least Postgres could say hey this index is now corrupted. I can't use it because the collations have changed So that's something I've I'm proposing right now for previous D 13 and I Mean ideally I'd like to Propose that to politics actually and doing that requires having an implementation. So getting something like that into libc for previous D would be a start to Trying to convince others to accept it as well very simple Just a version string on collations. That's all I really want to solve that problem I mean you could go much further and you could make it so that you could access the older version and and you know make it so that You could do that You can ask G libc for its version on on Linux, but then whatever that changes You that might be a little bit too pessimistic every time libc changes The database would have to assume that every index is now corrupted which would Be logically Valid but quite a hassle, you know you'd have to so you'd I'd like to get something that's quite fine grained Yeah, okay. Yeah, okay. So there's all kinds of ways that could go wrong So that's why I'd like to make a formalized way to to make it work the problem with this scheme though well I've got a patch that I proposed in the review site where There's a version that gets the The Unicode data that's used behind the collations has a version number So he just exposed that right the problem is that if there's a bug in the C code that reads that stuff And we change the C code it might change the behavior as well And that wouldn't be captured by that. So you kind of need a combination of the library version and the Data version. I'm not too sure. That's that's something to consider Yeah Okay, another problem is that the current stir call implementation Always always expands the strings to a wide format wide character format which involves malicking allocating memory expanding the whole string and then comparing them and then freeing them which Seems to be quite wasteful It seems it seems like it'd be quite nice to be able to do it just using the utf8 without having to decode the whole string For example, you might know the answer after comparing only the first two characters So that if you don't have to do any any malloc malloc or free and you can exit early then you can save a lot of time So that's something I know that this implementation of stir call is brand new. So I shouldn't be like beating it up It's like really cool that we have that now But you know, this is something that that probably not not much software hammers as much as Postpress does because we use this for any time you have an index on text. We're gonna be using stir call like that Another thing is that We have text that's not null terminated and stir call like all the libc functions works with null terminated strings That's a couple of problems. What number one? utf0 was actually a valid code point in Unicode But we just can't we just can't support it in Postgres So that's like one kind of Unicode string we can't support which people complain about every now Then it'd be nice to fix that but we probably can't fix that On the other hand we have to copy strings just to put a note that we have in some places on disk To put a null in the end so we can call stir call Which kind of sucks so now we're copying the string and then we call stir call and free BST circles Then copying the string again and then freeing it and then freeing, you know, it's it's crazy Like so we need to get rid of all that stuff. So one idea I had is to have a version of stir call which Has an N in the middle and you can give it the length and it will stop looking at bytes after all the characters after That character. I don't know whether it will be bytes or characters. I'm not sure And actually when I thought about that name I tried googling that name I just thought, you know, well, there could be a stir in coal and I googled that and I found that Microsoft's libc has that and I thought well, that's kind of funny So then I did some more googling and I found that it was actually in the proposal for c99 and then it was ripped out With the rationale that it it's not clear what it means like what if you compare two strings and you only if you only compare half of the strings That wouldn't be consistent with the with the longer strings order because sometimes the rules for comparisons are hairy like that So, I don't know. Maybe there's some other kind of function name you could use to not have to null terminate things All right, and finally there's a thing called stir transform or it's written X form Which gives you a short binary string It It mangles your input string in such a way that the results can be compared as binary you know bitwise compared with it with it with a Sterex formed string from some other string and it's guaranteed to have the same results as Sturkel would have on the original strings that's useful because you can take any size prefix of those and There are some situations where you're going to do many comparisons of the same string So if you can convert it to this form first you can then do integer comparisons on the first 32 bits or something like that So that enables a bunch of optimizations. However, however, we found when we tried to actually do this that the Sturex form doesn't Meet the guarantee in pretty much any libc. They're all broken At least I mean and that leads to corruption. So we had to turn that off which was disappointing because it went really fast Okay So that's something that if we could get it if we could get a reliable Sturex form that would be cool Okay, some more ideas On AIX they have this wonderfully named signal SIG danger. I love it. You receive SIG danger AIX I think was the first operating system to have over commit. I'm not sure if it was at least it was an early Operating system to have over commit. So the first time you Your program can be killed because there's not enough memory to satisfy that to you know to cash all the checks that have been written in in some sense and Before it starts killing people it will Signal them it'll signal everybody And that gives you a small amount of time to do something about it and Postgres for example could do something if it gets SIG danger It could throw away some cash data for example, and then the system could survive now It's an interesting question whether you want the system to survive or the system to die because if you're getting SIG danger on a regular Basis, it's probably not really sustainable something's probably wrong and you probably need to fix something So I'm not really quite sure I'm not really quite sure whether it's a good idea or a bad idea to implement SIG danger on FreeBSD In order to be able to use that but I do know that they have the same kind of concept on iOS On Apple phones. They have it's not exactly a signal But something like that where programs can clean up their caches and so on to reduce memory And I know that the people at Facebook who do all their stuff on lots of Linux servers They have you they have a user space out of memory killer instead of the usual one And they have that just so that it can tell you when it's going to kill you and you can avoid death Should FreeBSD have a thing like that? I don't know. I think it would be interesting It wouldn't be very hard to implement, but would it would it enable useful? Would it actually be good for a database to do that for example, I'm not quite sure Moving on to moving on to the ports so on most Linux distributions the packages for Postgres allow Parallel installation of different major versions that the major versioning scheme in Postgres corresponds to different disk formats for the data So changing from one major version to another is a big deal because it means You have to go through a step that will modify the data on disk and so Firstly you might not want you might want to have two versions of two major versions running at the same time for the Same reason that you can install Pearl 5 and Pearl 6 this certain classes of software where it's useful to have Major versions installed at the same time So We should fix that in the port so you can do that And then another thing is that on in my personal opinion This is a matter of taste in my personal opinion the the Debian maintainers have got this I think the best approach to actually managing Postgres Instances running on a server they have a package called Postgres common. It's actually written by the Debian maintainers And they but it's now been taken up by various other distributions including non-linux things And I think maybe we should become one of those non-linux things that takes it up, too It lets you do it. You know the way IO cage and things like that have a nice way to list Jails or whatever it just lets you list you've got commands like pg underscore alice clusters cluster meaning an instance of postgres You can see them running and you can start them and stop them and create them and so on I think if we took that code and Allowed you to install multiple major versions at the same time had this thing that could deal with them Like you can create clusters of different versions and and then we made it even better by adding ZFS knowledge and jail knowledge to it so that you could create clusters in in jails and you could snapshot databases and Back them up and all kinds of stuff like that using ZFS magic or integration I think that would be really cool and that would kind of take it to the point where you know Maybe it would be in a You know the way people take up free BSE because they want to get it ZFS You know if we could make it that good and like take advantage of the ZFS magic and so on with really nice tooling It might be something that attracts people to free BST because they want to run a database on it Very a couple minutes left. So I'm gonna faster One thing that I ran into is that when we use POSIX shared memory. I discovered that on free BST There's no way to see which free BST. That's right, which POSIX shared memory segments exist. There's no way to list them, which is kind of weird. It's like Someone never got around to adding the facility. I don't know what maybe it would be IPCS or some new command would show you the shared memory segments that exist on other for example on Linux there their shared memory open just uses actual files and on slash dev slash attempt file system or something but on free BST, it's in a hash table in the kernel that you cannot see We should really fix that because it makes it quite difficult to develop with these things You also can't see with proc stat minus V The names of any segments that are mapped in in that way. That's something we should fix Couple of other development hurdles I ran into valgrin doesn't work for postgres I'm not quite sure the status of valgrin is in the in the in the port it Aborts it does not like running postgres and that's something that I would really like to It's probably quite complicated to fix. I don't really know but yeah Another thing problem that I ran into is that detrace Ustack was just not working for a few months But I discovered a couple weeks ago that's been fixed. So I crossed that off Okay Cloud ABI is a really interesting development for running software in a secure way I think it will be possible to Make postgres run in cloud ABI that allows you to run It's an alternative to jails and containers. It lets you run it It could let you run a database entirely inside Trapped inside one directory so that it can't access files outside that everything's based on a file descriptor Which is the root of everything it can access using capabilities. I think that would be quite an interesting thing to investigate for previous D and postgres Since I only have 12 seconds left. I'm just gonna say My talk is too long for the session Yeah, I'm gonna have to yeah, well, I mean recording stops so I'm gonna Okay, I'll get a couple of minutes Okay, so a couple of other problems that I ran into that are things we could fix One of them is that there's this read ahead and right behind thing Postgres quite often if you act if you modify a lot of rows in a table It's reading them at one end and writing them at the other end as they fall out of its Buffer, which means that you've got these these two heads moving sequentially, but UFS and VFS has this Single way of detecting sequential access and it doesn't work because it's doing it's alternating between reading and writing in two Different places that's that's something that actually makes things go a lot slower on free BC than then on Linux so we could probably fix that by teaching by Increasing the smarts of the sequential access Detection of the heuristics used for that that is not a problem on ZFS because I should say because ZFS has its own completely Separate read ahead and right ahead probably most people using postgres on free BC are doing it on ZFS But certainly some use your face as well. That's another thing we should fix. I'm gonna skip that And I'm gonna say thank you very much for coming to my talk