 The last talk for today and unfortunately it means we are coming towards the ends of first time So how many of you have been to post MPG day, which happened on Friday? Awesome. So if you haven't heard before we every year just before first time we organize post MPG day One day full of postgres if this is not enough for you And then just just just just visit post MPG day dot org for more detail. So we will be here again next year So the final of talk of today is from Ilya One of our most famous speakers in the community So he's going to speak about the latest evolution of Linux audio stake explained for the database people not hackers Thank you, Debra Hello everyone That's a last talk my congratulations. You pretty much survived this Which is good sign probably the bedside that as last talk after Almost three-day marathon it would be about boring Linux things But I tried to do that Not that boring probably And well the main problem with such talks could be that you need to cover too many things like You need to explain how database this works How postgres works for example, then how Linux works and then how it comes together So I try to solve this Not easy task Let's go. So why we stalk first of all Linux is quite important for the databases I would say today is actually the default Operating system for database use for a long time. It was not like that Solaris HPX sometimes Windows Many operating system are used for databases now actually like it works usually an open-source word many even commercial database companies investing in Linux to achieve the better performance for their database and postgres regret to say maybe for 3bsd project of from open bsd project now Somehow aimed just to improve performance on Linux not because those operating system are Worst but because well Linux is what people want to run postgres and Send the central thing for database workload is fast. I or because you know sometimes We are running to CPU problems. Sometimes we run into this and that but well If we have intensive workloads, I or most likely a common problem Which we need to fight with well Another problem is then you try to figure out what to do with Linux and its iOS tech to improve your postgres call mysql performance for example is that You can find some posts on Linux network some emails into some mailing lists like postgres hackers or kernel mailing lists and all the information you can find there is actually made by developers and the language like they use Requires a lot of knowledge and it's written basically for kernel developers. So not for database and That task I tried to solve I tried to Get some clue in which Direction you need to go for a Google for to dig for information and another problem is Actually that Linux iOS tech recently redeveloped quite intensively because for a long time it was Really lots of problems inside the Linux iOS tech Different other problems which kernel developers need to solve. So for a long time it was like nothing happened nothing happened and recently during the Developing cycle of version 3 and version 4 and now version 5 It's really everything changes and wall systems overhauled etc. So What we are talking today about Very brief introduction Especially for the first line how postgres work with disk and Then slightly historical part What was in Linux with IO in all good times of old kernels and Then brief introduction what was new so basically maybe you will not have the complete picture after this talk But at least you need To put down the keywords Where to find the information after that so a very typical database based on these two Things you will guess that this typical database is typical postgres code database works in With a kernel space with the kernel memory the page cache for kernel buffer And it works with the user space its own shared memory segment So Linux takes care of this part and database code takes Care on this part and basically the same idea this page model is For many years here, and that's very convenient then you read some information from the disk you just put the whole page in memory and then if you Change If you do not change something. It's quite easy. It just like modern in-memory database. You just can read Some information return is it's a select but when it comes to updates You can mark some of the pages as dirty pages and then the are your problem starts usually so if you update something you need to write write the head log and Actually at some point you have your information about your recent updates in your right ahead log and This memory snapshot is inconsistent comparing to the disk image so from time to time you need to issue the nightmare of any DBA especially ten years ago The checkpoint and all the dirty pages are going down To the disk and that's that are your problem. You usually hit So basically if you had only read workload, that's not that bad then you have writes That's probably can be a problem for you If we are talking about database because we can talk about full-text search about file server for Database for a very specific key features for It's workload shared memory today Well, could be defined with The prices of RAM and then I started to perform talks about how to tune Linux for postgres I used to say well now 32 gigs of memory are cheap Some people start to love because it was quite expensive at that time Now actually I would say that one terabyte of memory is not that expensive so Basically, you can have a lot of memory on your database server and you need to use specific settings and well Linux and Database your database your favorite database all possible. I hope should adopt to to that So it can be really a lot of data and then you synchronizing those pages This huge amount of data travels down and up. So basically Here's the IO problems Besides of that, there are different things with IO which can be not that bad like Checkpoint spikes, but well, they can be still troublesome like right the headlock should be Written quite good that have some limitations on copyright file system sometimes and well Pretty much every point of this is tech should be optimized for that exactly type of workload then we put loads of dirty pages down to the discs What generates most of IO problems in postgres as I said this Page synchronization when we need to write a lot Besides of these auto welcome can be sometimes troublesome, but well it It depends of exactly how your workload works Sometimes cash refill can be bad, but well today many of you. I'm sure using SSD SSD disks, which is much better and sometimes you can have lots of problems with normal postgres workers With their IO, but this is out of scope of this talk and basically I can say then your ordinary Postgresquail processes performing some IO operations That's generally bad and you need to avoid that like those processes. I'm not designed to Make checkpoints for example if they trying to Dump the dirty pages to the disc that means that actually something Is quite wrong in your database setup. That's an emergency measure. It's not like normal process So for a long time for databases them Huge problem was how to maximize the throughput This word throughput was quite common for all talks about database performance in terms of IO and well Then you're talking about throughput from the user space from database to to the disks every part of this stack could be involved and Most likely we were talking about throughput because This part of the stair a stake of the stack was quite vulnerable The disk was low. So because the disk was This work quite slow a lot of affords and time of developers of linux Journal of postgres were actually concentrated around how to maximize throughput in the rest of the this stack because Latency of the disk was quite high. We need to move the heads on the disk and so on Which could not be improved And most people were looking on these not on these but technology actually don't stay in one place and Some evolution actually goes on and the station change Why I actually concentrate on throughput and latency If we have like so complicated is stack It's actually sometimes easier to maximize throughput like Using parallelism where we can the typical example of this well not Correctly parallelism, but some helper process was like then postgreSQL checkpointer cannot manage of the Checkpoints the dumping off all those dirty pages postgreSQL people come up and invented the Background writer so between checkpoints. We can use background writer to help checkpointer That's a typical example of Maximizing throughput because we could not minimize a lot and say any more Because yeah minimizing what and see is quite tricky but now the situation changed and now we have SSDs which probably Do not make the Dba job absolute but they can actually reduce this latency component in the iOS stack and We will system need to be adopted for This modern situation So Because of high latency of rotating disk that's some historical part There was lots of afford how to improve the performance of this stack in terms of throughput and You know probably all these recipes like you need to tune VM Dirty ratio and those things That was the times of Rotating disks as well as better back cash and things like this So This is like the Dba task to turn those parameters and so on but inside the Linux kernel There are lots of internal optimizations like that We're using SSDs, but the methods inside the Linux kernel are designed to move The heads on the disk efficiently, but SSD has no heads Basically, and it is a much more parallel aware thing compared to even disk array so This is an IS stack We can use direct IO we can use page cache in terms of postgres we use page cache But we need to go through all this stack Well file system that's nothing interesting here probably, but there are lots of benchmarks about this There are several approaches how to improve file system performance for example do not use write barrier on x4 But we actually look deeper into the kernel In the kernel there is a so called bio layer or block input output layer And the task of these parts of the kernel is actually to form input output request so this is basically biodesy structure and We take those blocks like like in database practically And we form some vector of those blocks to put it finally to the disk The interesting thing here is that We operate pages inside our database and in user space And on the disk on the old disk we are operating with cylinders sectors and those rotating things Heritage which was quite obvious than people were programming those sd driver scasi and so on and These part of the Linux iOS tech was specially designed to make some smooth and efficient transition from the pages to the cylinders sectors and those old-school things with rotating disks Then we actually form the request which goes for Ior and Instructs driver like put this piece of data there and so on We can optimize the input and output in some way and basically for a long time all this optimization was about where we store this data if we can Just dump this data to the disk using single movement of the Disc head that's efficient. So all the optimization here was based to like merge and sort these vectors of blocks to fit to the disk if they for example can be stored together And that's why emerged this idea to have an elevator or Ios Kedler which can actually help with sorting of all those pages and Putting them to the disk in most efficient manner Before the kernel 2.6 quite old time There was it just a single elevator so-called Linus elevator And well, it was like many things which Linus made himself Considered as a perfect thing very simple and very good working but in effect Nobody just cares if it works. So it's just It does something if it does that efficient nobody knows we have disk latency and well to optimize disk latency It's difficult physical limitation, so we just use something And this type of elevator had lots of problems mostly because It doesn't fit the job even for rotating disks, but for modern disk that simply wouldn't work Starvation of then you Then you write some information you need to read information from another part of the disk and cetera, etc So lots of problems between kernel 2.6 and early versions of Third kernel some people come up and invented some different schedulers but actually inside The logic of those schedulers there's still pretty much some improvements over the old Linus elevator The first of them more universal and more typical was like complete fair queuing the idea behind this type of elevator is like You have processes and for each process you provide input output queue which Basically provides fair access to Input output But when we have postgres just imagine how efficient it would be for example You have some connection to the database and they have some percent of IO When we have auto work on process it basically have the same access to the IO then we have check pointer and It has the same access to the IO This is pretty much not very efficient because for check pointer we need a lot of IO for a normal Postgres core process which performs on select. We probably do not need any IO at all maybe just takes the data from the cache or just Handles the connection of the DBA Which is idle so we don't know so for a specific workload like database This scheduler was always not that universal not that good maybe for desktop environment. It was better Then emerged the deadline scheduler deadline scheduler was like some sort of Improving the the idea for improving the situation Basically, it has two types of cues like cue for reading cue for writing and it just starts to read or to write but all the requests are actually tagged with timestamp and That timestamp allows kernel to figure out if this request timeout it or not and if it is timeout it like we have huge amount of Information to write it takes long time. We hit the timeout Then the kernel changes the priority of this Queue so basically If we have a lot of input output finally it would rise in priority and Work more efficiently It did it actually for rotating disk with normal rate controllers. It was much better than CFQ but still not perfect and well especially not perfect for SSDs and Some people come up with idea of no op or none scheduler for devices which allows much more parallel Execution of IO like disk arrays SSDs the main difference is like they allows much more parallelism and This idea was like basically this scheduler doesn't Change anything it doesn't perform Merging sorting things like that. It just do nothing. So it's like placeholder for a scheduler and it was turns out it was much more efficient for High parallel storages in compare comparing to other two and Probably at that point or slightly earlier and many people started to work on real improvements of IOS tag and now we actually hit Some terminology problems because As you remember on that diagram with Stack There was a IOS scheduler, which was some part of the kernel working with request layer There's lots of discussions if we need to improve this but finally Linux developers came up with idea that we need Substitute practically a lot of these so basically not only substitute the wall request wire and add some Elevator here, but adopt the device driver to work with a new analog of request layer and Probably change many things here so after the New approach emerged We practically have new input output stack Part of this is in VME or normal to memory Driver and part of this is so-called bulk MQ which is the scheduler and The request layer Simultaneously and now it actually has some new schedulers inside Because of effectiveness of this knob just the idea was we need the IOS scheduler which is initially designed to Support lots of parallelism First this thing was introduced in 3.13 and I think the latest version was merged into the kernel 4.10 together with in VME so before that it was not that efficient, but you can probably Run into it and probably use that together with late third kernels If you use in VME Because in the me It says that it's not but basically in late third kernels. It was in VME In spite of what you can see in your grab or whatever So idea is that SCSI is Not parallel and you cannot scale your input output requests So basically if from user space postgres with fsync in checkpointer you initiate the right Request it goes straight to the disk driver whatever you have Beneath And this is a problem basically because you issue input output request it doesn't scale it basically goes through the hardware How to improve those things the old approach to elevators was like we have a CPU a Q and the disk and that's just straightforward. So we basically go through and If we are busy we are busy then At some point There were some special cues for single process for example, which tends to be slightly more effective, but still not the game change for bulking queue approach was that it basically Paralyze these queuing and now in Linux we have for example CPU on human node Which has each? its special software query and through this software queue we can actually put a lot of input output and In this software queue you can do any optimization like tagging for specific processes of a specific virtual machine like Sorting merging things like that which tends to improve the performance of input and output and then all of those things I ended up in Queue in queues on the hardware and in case if we have an SSD we usually have more than one Hardware queue and this is much more efficient. So basically we got the parallelism on the Level of request layer. That's why it's substitution of request layer not the just Scheduler which works aside So basically Then we rebuilt the this part of the stack People start to think that we actually need another scheduler But the scheduler which can be aware of Modern SSDs Currently there are two major schedulers for bulking queue and VME Aware kernels newer kernels One of them which is more complicated and some people Tend to compare it to cfq is bfq or budget for a queueing the idea is It has some math behind which allows To figure out that we have of these input and output budget for this application for this device and based on this budget it actually Increase or decrease their priority for This input output. I wouldn't say it's quite efficient from my observations. So it's Complicated, but well it works The keeper is a different approach is just like a very simple scheduler which tries Not to Mess up with lots of mathematics around here with merchant sorting and things like that Whichever you prefer I have actually no recommendation right now because it's relatively new thing I would say that actually use the default with your current Modern Linux kernel most likely it would work better so From my experience there is no such drastically changes comparing to Like if you use deadline or no op on SSDs. So basically Not that important After that we have in VME driver and then sometimes we have improved Parallels cousin driver for that There is a good upgrade the ball which Maintainers keep current diagram of Linux input output stack You can take a look and figure out Much more details when I can put on the slide It's basically not that up-to-date about bulk impu, but well actually It's up-to-date like for two years ago one year ago So basically it's fresh enough to figure out what is going on and I hope the guys will update that soon What is actually in the EME in the EME it's not just a driver It's a set of standards for input output drivers, which is quite good currently and well We all appreciate that finally Major producers did that For Linux it's just a driver and set of standards how other parts of the stack interact with Ios Kedura For Industry now it's actually the new standard which is under construction. So basically We have only in VME currently working in stable for Local disk if you will put something into PCIe that works perfectly well fast and so on But there are lots of work to make this real for fiber channel for disk arrays And that's really a future because there that part needs lots of parallelism too So currently industry works on in that direction and results are quite good. So stable now is version 3 of this standard and the already Pre-production version 5 can basically allows more than 32 gigs per second for One channel of Fiber which connects you to the disk array. So it's very impressive disk performance. I would say But the next problem is like with that latency change Do the other database actually aware of that development and is it Okay, just to put the database on this extra fast storage Would it use that good? I would say that's actually a difficult a difficult question and We still for example in Postgres have no Real parallelism in bulk input output operation So basically we have a checkpointer, which is not quite parallel and we have a background writer, which is non-parallel and That's it. So basically The improvement of current storages are all based on hyperalysm store in storage and well the database Cannot handle this. So probably It should be a lot of things improved before we can actually benefit but well If you move from no op to modern kernel with bulking queue I saw the performance improvements like four times So it can be actually quite good quite good for you, but I'm not sure because I don't know your workload so well Situation could be improved. Maybe not so drastically like Excited developers on NVMe reporting What's the latest things in NVMe and Bulking queue during the development cycle of kernel version 4 The first thing was IOP pulling IOP pulling That's an interesting example For a long time the idea was like if you form an input-output request You send it to the disk and then this driver takes care about the results so you have an IQ and you handle it then the operation of IOP ended and then you return to the kernel to the user space at sound for Low latency disk This is actually not very optimal because you need to wait for the Interrupt and this is a long time probably your operation will end up sooner and you will have a huge overhead Immediately come up the idea of pulling So basically then you send the right request through the stack to the disk you can Constantly pull Your device and figure out if it finished or not and because SSDs are quite fast We run into another problem on this polling you can actually Spend lots of CPU resources and your CPU is busy and well your database is not happy so after the first naive approach Their so-called hybrid polling was introduced So basically current idea is that you send to the right request and at some point you pull the device and Try and figure out if it finished then you'll sleep for for a while and then you actually Pull again and most likely it would be much more efficient than wait for Interrupt so that was a huge improvement of performance of bulk imp. Q and NVMe together There are no new schedulers like bfq ecuber for 4.10 or 4.12 here, I think 4.12 is much more correct Iot taken was introduced so basically you can move the request from one cue or to another cue and based on those texts you can Manage the priorities of input output that was specifically good for Virtual machines, but I think actually for Databases it can be also quite useful because different processes have different IO profile And besides of that there are some direct I.O. improvements in the NVMe Connected to the internal optimizations of SSDs and A final small note on the direct I.O. because we're talking about databases. We are that's PostgreSQL the room and well, that's a question people constantly ask What is the current situation with direct I.O. in postgres and why not just? grab all the source and Open every file before direct that's simple why you don't do this So currently postgreSQL doesn't support direct I.O. for Anything useful in production. I would put that in that wet way So we basically can use direct I.O. For write-in-write headlock, but here's the problem and explanation why postgres Do not use that for something useful If you open the file with or direct you need to work with this file exceptionally with or direct because if you open the file with or direct and then I don't know archive command or Something like pgdump or something like this handles will Handles this file without or direct you will have a problems Most likely some crush or something like this and if postgres Works just like a standalone database that's okay to Write write a headlock with or direct, but if you enable replication you probably will have an archive command which is in postgres can be like Just bar script or something Completely unaware of how you open the file and Well, you will run into the problems and because of that Postgres automatically switch to Buffered I.O. for write a headlock if Even if you try to use Direct I.O. in this case The main problem from my point of view is that Direct I.O. is very little specific thing and you need to invest lots of efforts to Introduce and maintain Platform dependent Code so basically even if postgres go community will agree to do this that would be Like not easy task because you need to write lots of code specifically for Linux and Generally postgres go community is Not quite easy about Bringing into the project very always specific things well General direct I.O. maybe is the best solution we have now in spite of many People talking that it's not quite good In terms of implementation and so on That's well, I think finally postgres will move towards this direction Because now is like basically the only classical database which doesn't use that. I don't know actually It actually is good idea if we try to figure out if Linux version of MS SQL server uses that But well the only exception is MS SQL server for Windows because they do not have direct I.O. MySQL uses oracle uses DB2 users So that's all about direct I.O. and all about Recent Changes in Linux tech and if you have questions, it's time to to ask them Thank you Question In your talk you described only like bare metal implementations of post SQL Which means like private cloud private cloud where you can negotiate certain things with a provider and the public cloud has certain Has those Things inside of it, but it's kind of totally in different way. Oh well. I would say that Most likely they all have the same things inside because even if you deal with the public cloud They have serious problems with improving performance and I don't believe they do not use this Achievement of Linux kernel, but with public cloud Basically, you can never be sure what exactly they do and that's a part of the game but With so but it stops being being the postgreSQL problem. It becomes a question of choosing the flavor Well, I still believe that many people use private cloud cloud or bare metal installation Actually, and for them it's problem so well We are not talking about maybe private clouds here, but for Public clouds here, but for private clouds, it's still a problem. You need to tune your Linux You need to figure out what's going on So basically many many of those improvements were actually introduced for Private clouds like but so a little bit on that So the thing is you are provided with virtio BLK device in case you're lucky you might negotiate virtio BLK mod With multi queue support. Yeah, but it's still different from what you described because virtio BLK device is not SSD or HDD or NVMe. Yeah as a thing on its own Yeah, well, I wouldn't I would say that the idea of how to improve performance of postgres on Virtio IO on any virtual IO it's quite different problem because for example here We're talking about the locking of parallelism in the Linux kernel infrastructure to handle SSDs That's one problem, but if we talk about Virtualization for example immediately emerged the problem of unstable latency Which is quite different story and well, it's outside the scope of this talk It's also the problem and quite serious problem, but well, it's it's a different story Yeah, thanks Thanks for the talk Have you seen much demand or traction for open channel SSDs and if so like how far out do you think that might be? Well, I would say I have heard about that because I was interested in the topic, but Regret to say I did not see that Much in production because well my primary job I'm working for the customers and then the customers running databases They tend to buy good SSDs and use vendor-specific things for them That's it just real life, so I will quite appreciate to have some feedback if someone use and well I I will take a look at it Any other questions, please In just in your last slide you have atomic flag. May you give some details about? Atomic Well Some SSDs have Another option or atomic which is quite useful for rights But the problem is you cannot use or atomic without or direct So basically you need to use or direct to benefit from such low-level optimization forces this But will you support it for any? block size 16k 64k or only 4k I Don't know to be honest because well postgres doesn't support that I don't know And what is the smallest block for direct to you? Acceptable just speak about blocks, but you have five system, right? So even you write one byte it will be still 4k today. Well Postgres operates 8k if you don't recompile it so basically I can bring benefits on that block size and Basically, I believe it also will bring lots of benefits on larger sizes of the blocks Because well the whole mechanism was introduced partially for Databases and many analytic databases tend to have a larger block size Okay, and any plans and we don't need to call it sync when you use direct tail because right now it's It's a different story It doesn't depend basically postgres heavily relays on fsync I don't know if you probably visit Thomas talk Yesterday, yeah, I believe you About fsync problems. You will see that basically all mechanism is about fsync. So basically it's like post-extended as We understand that and Anyway, you know you need to issue fsync But why if you already own on all direct so 20 years ago Solaris use it all directly without fsync So why lean and still you need to think It's a good question Yeah, probably we have several right persons to answer that low-level question Yeah, the reason is that there is a development cost of implementing direct IO on all operating systems So we don't have the manpower to do it yet. We may next year or the year after that, but that's the only reason Thanks a lot. Any other questions, please? So thank you. Yeah, thank you