 Welcome to today's Postgres conference webinar, reducing costs and improving performance with data modeling and Postgres. I'm here with Charlie Batista, Postgres QL Tech Lead at Precona, who's going to discuss how Postgres organizes data internally, how the free space map works, and how we can reorganize the data model to take advantage of data alignment inside blocks. My name is Lindsay Hooper. I'm one of the Postgres conference organizers and I'll be your moderator for this webinar. A little bit about your speaker. So Charlie is passionate about new cultures, their languages and traditions, and until recently has lived in China. He's been working with databases and development for more than 12 years and has participated in small and large projects across Brazil, the US, China, and other countries. So with that, I'm going to hand it off to Charlie. Take it away. Okay, thanks Lindsay. Well, as you've done an introduction, I think I don't need this. This is like the animal. Welcome guys. It's a pleasure to be here. So we're going to have a brief overview on the underlying actor. And then we're going to go for the Postgres itself, how it stores data, and the free space map to understand everything how it's related to the cost that we have on the database. All right, so what does this talk about. We are here not to talk about modeling the data itself so we're not talking about how to create and why we need to create primary keys or foreign keys. I believe we know that it's important. The idea for the stock is to understand the internally, how Postgres internally organizes the data, and how we can make usage of these understanding to make the better of the database. So we are going to see how those organization happens on the underlying storage system. So we're going to see through the file system to the system, how the data is being organized. And the aim is that by the end of the stock we have an overview. To understand how the blocks work is inside of the file, what is the type of organization for the file that Postgres were used, and then we can better choose what the types of the data that we're using for our tables, and even organizing the position that we have for those fields inside of the table. We'll see that they play a lot of impact on that table. All right, let's start them. So the first thing I want to talk today is about the memory architecture. So this is just an overview of the modern memory architecture. And one thing that you want you to pay attention is that we have basically four types of memory when we're talking about any server. Basically, we go for the one that it's pretty fast and very expensive that are the register to the last one there is pretty slow but very cheap and usually pretty large that are the magnetic disks. So the register that really inside of the CPU, they are very small space that we basically use to work with with the data. It's really, really fast and very expensive and that's why we have just a few number of register when we go for inside of CPU. The next one is the cache memory. So, this is the first one that's outside of the CPU, but it's very close to the CPU, and it's very fast. So if you're going to buy a computer, for example, a laptop where we, you look at the CPU to choose a laptop, we're going to see okay what is the speed of the CPU. How does, how many gigahertz mega has the CPU the clock works and then we, you're going to take a look on the cache memory, because the more cache memory the CPU has the fastest if you can can work, because from the memory architecture, CPU cannot work data when the date is on the DS, the CPU needs to work with the data when the date is in register, and they think in register, they can be cashed to be accessed pre quickly pre fast. And the first cache area that uses the cache memory is a memory is a diff is not the same type of components as we have on the main memory, but they are very similar the cache memory is use a much more expensive companies that we have on the main memory. So then we have the primary memory that's the main memory is that the Ram, when we talk about computers, and we ask okay how much memory your server has, we're talking about main memory, the Ram. So it's getting much cheaper than 1015 years ago, but it's too much more expensive than we have on the default type that were the the magnetic disk and nowadays is very common as well the SSDs. So you'll see that those play, how they play together here. And keep in mind that CPU is not able to have direct access to the phone. So if you only has physical indirect access to cash memory, and to the main memory, every time the database needs to do something, it needs to ask for the IO system to get the data from SSD or for the artist, copy that data inside of the primary memory, and then the CPU is able to work. And this is important to understand. So the memory, it can be divided and volatile or non volatile. So a volatile, as the name says, it doesn't hold the data between restarts or when we have a power loss, for example, is the Ram memory, the Ram, the cash members, they are all volatile. So when we restart the computer, everything that is in memory is just lost. So it's usually pretty fast and very expensive. And the opposite, the secondary memory that are the hard drivers, they are non volatile, they're much larger and cheaper than the primary memory, but they are usually very slow. And as I said, CPU has no direct access to secondary memory, CPU cannot directly access the hard drive. So when we go to memory access, we have basically three methods, we can have a random access method sequential access method, and those are the two ones that we interested here. So just for reference, we have direct access, usually use it when we use with tape and those sequential drivers but we are not interested on that they do not play along with the body going to talk here. sequential access they're usually much faster. So for many reasons. The spinning disk, the old hard drive of the spinning disk. The main reason why sequential access was so much faster than branded access is because the spinning disk they have a moving part, or moving parts, I would say, and the one that is the slowest one is the head. And that we have the spinning disk and the head needs to move from your position. It's pretty slow, because it's used a synchronous motor to do this movement. So it's it's a very slow movement. And this is why makes hand them access so expensive and so slow, because when the data is random distributed along the disk. It means they had needs to move for and forward, many, many times and make the access so much as well on the spinning disk on the hard drivers. When we have the data on sequential access. It's usually they had just need to go to initial point, and then let the plate keep moving and then just read the date along so it's much faster. When we go for SSDs, of course, we don't have those problems, we don't have those moving parts, but we still have other things that play along that one that we're going to see here that we call locality locality plays a big hole. So when we're reading data, and we can see clearly on SSDs for example, so data locality is the concept when we have one data next to the other. Let's go back here. For example, let's say I have a file is stored on the block number two of my desk. Data locality says that the possibility that we need the next block when we're accessing the block number two is pretty high. So the disk can just prefetch preload in advance this data inside of the memory. And it speeds up the performance of the system. And this is one of the reasons why when we have sequential access is too fast upon SSD, compared with random access, because the SSD can just do prefetch of the data pretty easily. And even though there are NVMe solutions, they're improving because they can read data multiple times. So, and then you can have just one piece of hardware or software organizing how all the data is written and you improve the concurrency. So even when you have random access, and you cannot just prefetch the next one as because we have many, many process reading the same data at the same time, we can also improve the performance to be a little bit similar to sequential access. So that's why on SSDs the difference is not so large as we have on hard drivers. Another concept of locality, as we see here is a special time locality. So when we have temporal locality, it means that the data that I just accessed now, the chance that I need to access this data again in the future is pretty high. So for example, I have my system, I'm selling products on the internet. So a user just goes and check if I have one book. So the chance that that very same book being searched in the future is much higher than another book to be searching in the future. So if I can keep that data in memory longer, I can speed up the performance. This is the temporal locality. So I just keep the same information in cash in memory for more time. So I can just speed up the performance of my database. Both those both concepts, they're pretty use it on storage designs, especially when we have cash. So we take advantage of temporal locality and spatial locality. So we can design our system that we can put the data together, one next to each other. So we can take advantage of the spatial locality and to do pre facts or pre reads when we're reading the data. And we can also have larger cash or larger buffers to keep the data that we're working on on the on the main memory for more time. So on this way, or design can improve the performance when we are working on the database using those two concepts of locality. And here is just a comparison when we have a hard driver. And in this example here, we have a block of eight kilobytes, and we're reading access in a one, one K blocks 1000 blocks. So when doing random access on hard driver. It took six point one seconds. So the same access the same data when we do sequential access for the same one kilowatt 1000 blocks, we have 53.1 milliseconds. So look that we have orders of magnitude on the axis time. Our comparison when we have SSDs, we're going to see here that the difference is much smaller. So it's almost double the time. So when we do random access for the same 1000 blocks of a kilowatt kilowatt blocks, it took 4.17 milliseconds. And when it's sequential, it took just 2.35 milliseconds. The difference is not as huge. If we compare with spinning discs, but it's still almost two times as the performance. So it's still when using SSDs and MV driver, solve the driver keeps sequential access is still much faster than when you do a random access for for the same amount of data. And one thing that I forgot to mention here is when we're talking about data access, we need to keep in mind that when we're accessing data from the hard driver from the teeth, we cannot access a one single bite. Every time that we access data, we just need to read at least one block. So, and the block size, it varies the change from system to system and some even some device they have different block size, we need to check the manufacturer to see what is the block block size. But the most common block size we have in the market is of four kilobytes. So, keep in mind that when you ask for your database you do a select and select name from table user. So, and the name is pretty small. Let's say we have a few bytes for that name, right. So the database is still needs to go and to underlining storage and read the whole block. So the whole block will be put from the disk inside of the memory. And understanding this concept is very important, because we can keep the data aligned with the block size. If the system, in our case, a database, use the information that is aligned is the same size as the underlying block size or multiple of the size, for example, let's say the storage has eight kilobytes block size. If we use 16 kilobytes block size, we're still aligned, right, because our block size is the size of two blocks of the underlying system. So it makes everything works much faster, because if we don't work on this way, when we save one, one data, for example, can just waste space inside of the, the file system, let's say, the file system has a four kilobytes block size. If we inside of our database, we designed a database work with five kilobytes block size, when we send the data to the file system, it is not able to use only one block. It was two blocks, but one block when I store four kilobytes and another block one, only one kilobytes or one kilobytes. So we're wasting for every time that we save data from the database to the file system, we're wasting three kilobytes on this example here. And that's why it's so important to have them aligned the database and the file system aligned with the block size. So now that we have a basic understanding how those things work, how the files work with the underlying, what the database works with the underlying system. Let's dig in inside of the postgres. The postgres use one type of file that you call a heap file. It's one of the simplest form of file organization. So it's organized in blocks. As we expect so inside of the file we have many blocks, every block is defined as four eight kilobytes, or as default on postgres, we'll see later that we can can change but it's not that fast. So the heap file is organized in blocks. It doesn't keep any order from the information. So it's just one unordered set of records is stored inside of each page. So because it doesn't keep any order, it's always try to reset by the end of each page, one we doing set. And it's very efficient. The insertion on postgres is very efficient because it doesn't need to, to reorganize and to do any reorganization of the file. So from partisan, another very common file solution that we have for the database is what we call closer files or closer index. One, using closer files, the file is organized as a between normally. So it's organized as a tree inside of the file. Every time that we insert something we need to make sure we keep the order inside of the file. That's why it's more costly to insert data when we are using a tree structure like we do for other types of organization. But the heap file is just a bunch of data. We just have a bunch of pages one after the other. So another concept that we have, especially when we're working with database is that we don't have real deletion on the database. When we do a delete the that was not really removed from the database on postgres and on heap files, especially in postgres, they market for deletion. Eventually that that space going to be reclined, going to be cleaned, and we'll be able to use again, but it's not instantly removed from from the data file. As the same way, when you have an update, an update is basically deletion forward when I insert. So it marks the old role as removed and insert the updated data, the new data in the end of the file. So when we do a delete or we do an update, we always leave those obsolete beta is the vacuum is so important because when we do the deletion. We need to to be able to clean up this that information be left database to reuse the information. That's why it's so important when we have a vacuum process planning, and it's so important that we do not disable the data on postgres. Sometimes we find problems out vacuum is not coping with the dated amount of the data. So we need to proper address and optimize out the vacuum process. Instead of disabled seen a lot of companies a lot of DPS that just going to disable the vacuum. It's super super dangerous and can bring a lot of problems. So try to never ever disabled out of action or in production database. Okay, he files inside of postgres. So, as you said, postgres use hip file. So it does you the hip files to store data for all the tables, all the tables are a heap files by essence in postgres. So if the file has a limit of one gigabyte. It doesn't mean that the table has a limit of one gigabyte but the file itself. So if the table grows over one gigabyte you just have more and more heat files, but one, each of them, of a limit of one gigabyte, the page inside of hip files on postgres and they're off eight kilobytes. So this is the default value, we can change the default value basic we cannot change on runtime. It's not the configuration parameter that change and then change the file, we actually need to compile the database if we want to change the size of the file. And it's usually not advisable. There are some circumstances that we can improve performance. But most of the case is not advisable. If we try to change the hip file for whatever reason, make sure you do really understand the reasoning why you are changing and the consequences and do a lot of tests, because that can bring a lot of problems. In the table, all the pages they're logically equivalent. It means that the data can be stored in any page. So, as we insert in an update and moving the data inside of the tables inside of the goals, it can be stored in any page. So I'm going to have a, I'm going to show you a quick presentation, quick demo later. And I actually I've saved a lot. I was planning to do a lot more here. And I've just been saved. The presentation like the best that I've done here. I saved some comments here. It's on my GitHub. I will put the link on my GitHub so we won't have time to control everything here. I just separated a couple of examples that we're going to have during the presentation today. But with the files, they will be saved on my GitHub page so you can download, you can go and see the comments or what everything is done. Okay, moving on. The page layout. So this is when the things start getting interesting inside of both ways. As we know, they hit five, they're divided in page and each of those pages, they are very specifically out. Those are the place that we really save the data where the date is stored. They start on those pages by the full. The page size on both ways is of eight kilobytes. We have a header here in the beginning of the page that has all holds all the information for that specific page. Page size, I mean, page size, page number. If we have free space inside of the page, so we have a lot of information on the header. And then we have some pointers. Those what we call here those item data, they just pointers to the data itself. They are not the data. And by the end of the table is where we start to have the data inside of those pages. We have in the very end is this special area here. This is not important for our talk is just an area that holds information about indexes and they are most usually on the indexes files. So for our discussion here, we don't really care about what is being stored in this special area, but you're not going to use. But right after that is where we start storing the data. And the postgres, the data is stored from the end of the page and grows towards the center. And those pointers, they move the growth towards the center as well. And we might ask the data that are stored in the same place, right, because they're stored in the same page. So why do we need those pointers here. And it might be weird. And actually, it's a pretty interesting design. And one of the, the reasons not the only one but one of the reasons is what we call the heat only double updates or hot updates that sometimes we hear people asking if our database does more or less hot updates. What happens is when we create a table, let's create a let's say create the table user. So from the table user, I have a couple of indexes, right, I have a primary key, and I might want to have an index on the username. So I have a couple of indexes and every time that insert data here, the index needs to be updated. So when they update the data here, the index also needs to be updated because remember, the postgres doesn't do an in place update, let's say I update this very last couple here. What postgres does is marks this table as removed. Copy the data and save the updated data somewhere else. And for example, let's say it updates just this here is with the new data. So what because just because we have indexes for this table here, the indexes needs also needs to be updated with the new location of this data right. Well, with the heat only double if the change that we do happens inside of the same page, let's say we have a space here inside of this very same page to put the data. The index doesn't need to be updated because the index it points to those pointers he is off inside of the page. We can just update the new position inside of the pointer here to point to the new data. And we can save a lot of IO. Because you don't need to update the indexes anymore. Why one of the reasons that we will have this design inside of the page inside of postgres. So if we have enough space inside of the page, when we do an update, we can do a hip only update. So it's only updated here on the hippie page on the hip file. And it saves a lot of IO. So it makes everything way more performatic way faster. That's one of the reasons. And well, the, the couple, they also have a design. So a layout. And this is the layout of the couple. We also have a header. There are a lot of interesting information on the header. But for this discussion, we're interested basically in two informations. This is what we call the T hop. This is the offset on where the data really starts, because the data doesn't really start all the same place of the same position inside of the, of the table here. And this happens because the way that postgres works with new data. So when we have a table, let's say I created a user. And as I was explaining here, we have the dispute map for no data on postgres. So what postgres does is for every role that can be no, it has one beat here that represented represents if that that column is no or not inside of the phone. So we can save a lot of space. When we're working with a large table that they can have a lot of mouse data for, for example. But because of that, the header cannot be of the fixed size, so we cannot just have a 24 bytes header, because if we have a really large table, we might need more than one byte to represent all the columns that we have inside of the table. So, and because the header cannot be of fixed size, we need to have information what the data really starts. And this is how we can see here. And remember I said, I just prepared some some examples. So we have a database here that created name test of database. And for this example, only remember, only for this example, I'm disabled out of action. Again, never does do it in production place. But I don't want the auto vacuum to play along when we're doing the presentations or I'm going to delete some data. So I want it to the space to be back. Just to make sure the outcome is the same. I'm also using three extensions here that to that helps to inspect inside of the data. One is pretty free space map. So when we start talking about free space map. It's very useful to inspect the data. PGA page inspect is the one that I'm going to use for the moves. So it shows us information about the page itself inside of the Postgres. And for the visibility map, we have another one named PGA visibility. So again, for the file that I've saved. And I have comments on all the examples that we must run. So, but for now, I've created this table I just named TQ item. So it's a very simple table. It only has a couple of rules. Oh, one item type it's integer of two bytes. We have the idea of the key itself because we can have more than one queue. And so this is the idea of the product table the kill. So if this item in the queue is active or not, the idea of the item itself is the idea of this table is of integer eight. And the value that it's hold and the party because the item can have some fire in this example here. As you see I don't have any index. So we shouldn't bother for now about index is just a very simple table with no index at all here. And I've populated with one million rows. So, if I do a select count. You start from this table. So, oops, we're gonna see that we have here one million rows. So it's been populated already. And. Okay, this is the size of the table that we haven't used. It has a 73 megabytes that's all fine, all good. And what I want to show here and it's related to what we have that is that we can get informations from the table itself. So I'm getting here the items inside of the page, number zero. This, this table has a lot of page because they have one million rows. I just, I'm just interested in the first ones with the first data. So we have the header and we have a lot of information. So if I do it. And this one, you'll see here that we have way more information from the collected. And we have the data itself. So it's in a, it's a binary representation it's in X decimal for the data that we have inside of the page. And we have the header we have information about where the data starts. And here is line pointer here is, let's go back is just this guy is the pointer that we have even of those those items here, what are what we call the liner pointers. So they will point to the place where the data is an inside of the data, we can also get where the actually data starts from from that page. And here is the size of the data inside that we have so in this for example here line pointer one, the data going to start at this position inside of the page. And going to hold 72 bytes. So this is the size of my data ID one. And actually, we can do a select this stuff from TQ item image one. This is my data, I have an item type I can see. So this is what is the representation inside of the table for this data on that we have here. Okay, moving on. Before talk about post that I want you to explain here and we go back to the to the database is about data alignment, and this is probably the most important thing that you take from this this talk. So one of the database postgres tries to keep the data, the size of each data, or it flow aligned with the world on the CPU. So, to be able to efficiently work. The CPU is always try to work with the date of the same size that we call work size. So if we can save the data, if you can save the information or work with information inside of the database with the same word size that the CPU works, we can make it really fast efficient. And most of the CPU is a more that actress the word sizes is eight bytes and postgres has internal natural alignment between each column inside of the table that tries to keep them aligned with eight bytes. So what does that mean. Let's come back here to my example, let's make it a bit more room here. So, in my table, I created here, the first column is inch of two bytes. So this data holds two bytes, two bytes is way is smaller than the eight bytes aligned that we have right so it's initially is not a problem. The problem is the next role that I have here is off eight bytes. So if I saw those two plus eight, of course, want to be larger than the data work, but eight bytes. So I cannot grab those two rows together to send to the CPU to work that was in the way in this way. I need to send these data, and then I need to send this data in another CPU cycle. So, but to make it works properly internally, postgres will do what we call padding, if you just add eight more, six more bytes here to make it aligned with the word size. So we're going to waste 123456 bytes inside of this table, just because we, it's not proper aligned. So if I had, for example, another one that was type of two bytes, and let's say another column here that is type of four bytes. So that's fine because as four bytes only needs an extra four bytes for alignment. So it could send those together and we have eight bytes alignment with this, this data here. And how can we see that. So postgres actually give us the ability to check those those information. Remember here, this select here, if I run. So I can get make a bit more room here. I got the position of the data that I have inside of the page. Right. So this is the position that I have inside of the page. I can use some OS tools. For example, I can use here the x that to just grab go there inside of this, this, this file and check what we have. And let's do this now. So first of all, I need to know where is the, okay, this is the folder that my data is. And the ID of my best database. Okay, I have the way and the idea of my table. Let me just make sure I'm pointing you to the right place. Okay. I'm pointing to the right place. So, let's go to here is my, my data G the data folder of my database, of course, have this file here that is the table that we just created. So if we run this x that comment here, what we're going to get is the x that's my representation of the data that we have inside of this. So remember that the first 24 bytes in our case is the header of the item. So the last bite here from the for the header is the information of notes. So this is the knowledge map and the data starts here. And this one I suppose is the ID for our table. Let me just go back here. I have the select. Okay. Okay, this is the select I did I just copy. Yeah, copy and paste here to make it easier for us to open. So this is x a decimal. I am horrible doing math for x, converting x a decimal to that's one of these kinds of stuff. What I'm going to do is just using a calculator here. So my calculator if I put 25 to x a decimal and to the decimal. So this is the value that we have. So this is obviously the data. And if I let's go for the next next one. So, because of the Indian us that was used. So the organization of the data here, we're going to start with 11. Okay, but it is x a decimal tree. So this is 11 and it should be for 113. Yeah. So this is the next day. If we pay attention, look, those many zeros here, actually from here, those zeros were not supposed to be here, because they're just waste of data. The possibilities are using this padding here to make possible for us to have a world size of the same the same size of the CPU to have an alignment when the data goes to the CPU. The next one is fine, because here is an integer of eight bytes. So it's just the size of the information. But if we go for the next field here is is active it's a Boolean one, a Boolean old one whole only holds one byte of information. And in our cases is here is true. So we are again wasting seven bytes here inside of this flow. Only two roles. We can sell off. That's not much right. But for this table that we have here for example, what if I reorganize the table. I'm going to do it. Your organization of the table I have here. Okay, I created this table again. Look, I created table, what my name TQ item good, right. So I'm going to drop this one. And I'm going to create this table again, I use exactly the same, the same items that we have like, and that the other table so they exactly the same structure. So what I did is I organize a little bit different. So I put all the eight bytes fields or integer eight in the beginning, QIT QT MIT, you can find, then I put the ones that are off to four bytes together. And the last one is a decimal. The thing is about types like decimal and numeric and vodka is the database doesn't really know how much data they can hold because, well, they're variable, right. So as when you use vodka or decimal, we the size of the data can vary because the user can just define the size at once. They don't have a predefined type. That's what I mean when I say the database doesn't really know the size. Of course it does but it doesn't have a predefined size like an integer, a big integer or the integer of two bytes. So I just reorganized the data here. Let's create this data. And as I created as a select from the original data, I have exactly the same data on both two tables. There's nothing different from one table to another. They hold exactly the same data. I'm going to do a check point to make sure I going to flush everything. I don't need to do a vacuum, but let's do a vacuum just in case that I, what are you going to do a vacuum because I want to make sure that if I did any update. They won't interfere in the size of the table so that we really have the original tables. And if I check both tables, the size, what an amazing surprise here. The table that we have those empty space holds 73 bytes, megabytes. The table just because I reorganized the columns in a different fashion, we have exactly the same calls as columns exactly the same size. I just change it the order. Now it's 67 megabytes. It's almost 30% difference that we have used is 2725% performance and I mean space saved inside of the database. Remember, from the beginning of the talk, everything that we hold inside of the database, if we need to work, it needs to go to memory, and the memory representation of the data is exactly the same. Copy the data from, you ask the IO subsystem to copy the data from the disk and place in memory and then work with the data in memory. So we'll also have all that waste space inside of the memory. And that's is how we really get some improvement and we can also save a lot of quotes. It's just one table and it's a pretty small table. So imagine when our tables start growing to 30 bytes of information. How much we can save by the end of the month, when we need to to prepare the bill for AWS, for example, when you have there, or how much storage we save if we have it on premise. And it goes to memory to CPU so it's a lot of storage and cost that we can save when we understand all those data distribution inside of the database. Yeah, I skipped here the post. So now that we understand how it's saved. Well, another thing that we need to pay attention is the database Postgres. It doesn't split the data between two page. What does what do I mean, it doesn't split. Let's say I have a big text that has more than eight kilobytes of sites in one in one row. It's not able to save inside of the data page as we hear this. So what it needs to do is it creates a model file that it calls post file. The representation is very similar. It also works in eight kilobytes page, but it creates it says that data inside of the post file externally. And here inside of the couple instead of the data is just create a pointer for the position of that data on the post file. So every time that we have huge data that they're really huge and goes more than a default value that is of two kilobytes. This is what Postgres call two large. So roll that's by the full larger than two kilobytes. It's stored elsewhere. It's not stored inside of the data file, but elsewhere on a post file. So those also play a role when we need to save space, because now, if we have data that can be really large, and if we can put that data by the end of the file. So the organization that Postgres does, and the padding that needs to be done because it just have a reference can be avoided. So we can also use the same technique to save space. And, well, yeah, we just saw how the alignment and padding can play a really big, big role here. So this is just the calculation of the example you just did. So for the same one million rows from one table or another table we have huge difference just because we will change the organization of the columns inside of the table. So it can save a lot of space. And we are going to the end of it. The free space map is the last stop that we have. Yeah. So one thing you remember that when I said that when we delete role. We update we are not really listing that role in place. So the database needs to go there mark that role to be removed. And then, when the auto vacuum or manner vacuum pass by it's going to freeze up that space inside of our data file. And the Postgres needs to have a mechanism to keep track of that data. And the mechanism is what we call free space map. So the free space map, as the name says is a map. So it works in a very similar to a tree, a very fashionable way to a tree, it tracks all the empty space that we have inside of each page. So as we can see here in this diagram. So in the end we have the page number. And then the next role here is how much empty space how much free space, we have for each full. But here, the number like for the number two, it doesn't mean that we have two bytes or two kilobytes of data, the free data. So we, we have two units of empty data inside of this this page, but what does that mean. To keep it is more to keep the free space map is small and performant. And the database only use one bite per node. So only one bite to storage and represent all the information that we have inside of that page. So if the page page has eight kilobytes size. It's not possible to represent eight kilobytes using one bite right. But what it does with us, it gets the page size by the full that is eight kilobytes and divide by the bite size. So the bite holds to 65 can hold up to 265 of information right. And each unity represents the result of this division in this example here, each unity represents 32 bytes. So if this page page number is your number one here has two units free. It means it has 64 bytes free. So in the representation here on each sub three, it represents the largest number of bytes that can can be found here under the three. For example, here, the largest one is true. We can clearly see we have the page number zero holds zero units and the page number one holds two if we go up. We have eight marks of eight units here in this also between. And it's on the page number three. So it will help the database to navigate when it needs to define available space inside of each page and that's going to see this example. Let's say we want to find a page that has at least four available slots in the vicinity of page number four. So close attention to this vicinity here. Postgres and most of the database will always try to keep a special work quality right work on it on the position. So if we, if we are updating something inside of the page number four, it always try to keep that same date inside of the page number because we can use the hot update. If it's not possible, it will try to keep it closer, because let's say we need to do a full table scan. As we need to do a full table scan, the more sequential we have the data, the better. It's going to be faster. So we don't need to do random random reads or random access and that's why it's always try to keep the data close to each other. So the question is, okay, let's find a four available is what is locked on the vicinity of page four. So the database we first find the page four. Okay, this is the page number four, the page number four can only hold up to two slots. It's, it's not enough. Then we go to the, to the tree up. Let's see in this sub tree here how many slots we have. Well, the max we have is three. Still not enough. We go to the next sub tree up. So in the next one, we hold seven. So the maximum value here is the seven as we are coming from the sub three from the left, of course, there is is not on the left, we need to go to the right. Okay, and yeah, this one holds at seven as well. And if we go to the left here, we just find one page that's page number 16 this example that can hold for those four slots. So this is how the free space map works. And because we only use one bite to represent the number, usually the free space map is really, really small. So let's compare to the database file to the file size. And let's just try to see here if I have one. So I did dot to start. Of course it changed because I did a vacuum fool and the vacuum fool just changed the name of the file. So I need to get a new one. The new one is sorry. Okay, this is the. Okay, one five, one six five. And if we go here, we don't have any free space map here. Right, but we can force to queue item where it equals 17. We did 16. And we should just run a vacuum. We don't need a vacuum, we don't want to vacuum. And here we go, we have the free space map. If we take a look the size of the file itself the data file 74 megabytes, the free space map is only 40 kilowatts. It's just a fraction of the size that we have, because the way that they does the representation of the free space map. It's only used one by four to represent each of the pages, and it can save a lot of space, and it's very efficient, but the way that it starts. So that way, we come to the end, we are to the summer. And what we saw today. So, where today we saw that posters the stores that they can keep files, if I was very simple, former organization, the hip file for posters is divided in blocks of eight kilobyte each. We saw that the data has no order at all. Even when we do insertion like we can we can say, Oh, it can have the solution order. But if we think about when you do an update, it just changes the order because it doesn't update in places mark the item as estimated. When the vacuum comes and clean map something else can be inserted on on that space. So we cannot guarantee any order inside of the page. The post with controls the free space map you using a tree structure that holds one bite for it to work so it's very, very efficient. Deleting a record, as we saw it's not really moving to just mark and as we move, and that does the same. Post this use the natural alignment of eight bites internally, and this is probably the most important thing that we need to do from the stock. The natural alignment and how it does the bedding and how that the bedding it does can really cause bloat inside of the table. And without organizing the table which we're not able to to remove, but if be careful when we were designing or more than the database, we can just take it to minimal. So and you can save a lot of space, and you can save a lot of data. And also, well, you can find this is lights and base groups are here at my GitHub, as I said, and here are my contacts here is you have my my GitHub. And that's it. Thanks guys, and we're open for questions you have any. Charlie. Thank you. That was great. We have three questions here. Are there any real world systems that use a block side other than eight kb. Well, there are. My as well, for example, if not mistaken, it's used block size of 16 kilobytes. And there are systems that they use block size of 64 kilobytes, for example, and it's not really a problem if designer choose a block size, or as this large as this big. The thing that take we need to to have is, we need alignment, right. So as the most common block size of OS is of four kilobytes and the hundreds of four kilobytes, as long as we have alignment between the block size of the underlying subsystem and the database that we're working. So we are good to go. And if we understand that say, I have a system that I save a lot of images and different types of binary files, and they are pretty large. So if I can make the block size larger for this my specific system, I may gain a lot of performance, because especially for SSDs that can read a huge chunk of data at the same time or even for hard driver, because we're going to have the sequential read that huge block size, we can speed up the performance a lot. But it's, again, we need to understand how our system works. So, and we need to understand how the underlying subsystem works. So we then we can make the right decision if we can go with more or larger block size as eight kilobytes, like for example, as this does. Wonderful. Question about spatial locality. How strong is spatial locality usually between different tuples on the same page, given that tuples can be stored on any page in a table. Do you know any benchmarks around this. So, the tuples on the same page. Let's say, remember that when I said we were the database doesn't read bytes from the piece. It's really blocks it will read the whole page. So if we're accessing more than one roll from the same page. So, you just have the the best performance that you could because the database is just get that the first send page put on memory. Right. So it just need to do one run inside of the day. We don't need to do anything else. So you get everything that you need just from one ask from the disk and you put the memory, and also the CPU, when the CPU going to get the information. Get from one run from memory. So it's really fast. When the data is distributed along all the system, let's say random random way, it will depends a lot on your system. Let's say you are using SSDs. So, in best case scenario, you're going to spend twice as time to get the data, and you also can probably get a lot of junk data that you don't need. Right. For example, let's say you have the table user. And you want to select the user from ID one to 10. So if you have the ID one to 10, all in the same page, you only need one page. So if you did a lot of updates, and every ID is a different page. So because you have 10 different page instead of get eight kilobytes from the disk, now you're going to get 80 kilobytes from the disk. So it's going to be at least 10 times slower from the disk perspective. And because you need to get all those 10 pages, just because information is not on the same spot on the same way. That's how special locality can be so influential when designing the database. Right, because we can just split the data can need to go to the disk many, many times. And if you think about that everything goes to memory. So when you have everything on the same page, those 10 rows on the same page, you won't put one page in memory. If you have the separated on 10 pages, because they are just not together. So now we have 10 pages in memory, and you might not need the information for the other pages, you just might need one single row for each page. All the 90% of the data that you put in memory is usually less, at least for now, for your database. Right, so and the impact can be huge and benchmarks. It's hard to to talk about benchmarks on because, as I said, every system, they might have a different behavior. So, if your data is really sparse, it's personally distributed on the database they're not together. All the forms of degradation can be pretty huge. Beautiful. And the last question. Why does Postgres not automatically reorder the columns for optimal alignment. Because it's expensive. It's very, it's a very expensive operation. If you have these small tables that say like this one, one, one million rows. So I might just need a couple of seconds maybe milliseconds, especially if I have a good iOS subsystem right. So but can you imagine that every time that you do an update on the table, it needs to get reorganized real life. It's very expensive. So, and we are trying to to get thousands or millions of operations per second right. So, if every time that I do an update, and if I have a thousand updates per second, I need to reorganize that they were 1000 times per second. It's not doable. Right, it's fall for the OS perspective and they are your perspective it's very expensive. And that's another reason that we have buffers and cash. And the OS, most of us Linux, the one that I work and I know you have the IO cash IO buffer, for example, what the IO buffer does. So it keeps all the data that's been updated in memory. And it doesn't immediately flush to the disk. It will flush to the disk every second or so it depends on the scheduler, but it's just hold all the data that you update. And this is one way that we found to try to keep the data organized. So every time that you write to the database, instead of it just flush to the disk. The database is going to hold that that data in memory. So of course you you you have your binary files and the data files that that you save that the change you did because we need to keep it right. It's just for persistence, but the data files itself is not updated immediately. So, because the concept of temporal locality. If we hold that data that we just updated now the chance that you update that again is pretty high. So, if we have 1000 updates on the same table on the same second. So we can just hold those 1000 updates in memory. And when we flush that data from memory to disk we just do one flush. We don't reorganize that data, and you can save a lot of. And that's the reason why the database not only posters but all the database, they don't reorganize the data, every time that you do an update one insert. It tries to keep the to hold the data in memory, and then from time to time, just for sure everything. So then it keeps the data locality as better as it can. Lovely. Thank you. Those are all of our questions. So Charlie, thank you so much. I know we had some technical difficulties but you know it was very worth it to stay on the line. You're getting a ton of thank yous and good jobs in the chat so on behalf of all of our attendees thank you and great job. Well, yeah, I will say thank you all for the payments. Thank you for the opportunity, of course, it was a pleasure to be here. Brilliant. And to all of our attendees thanks for spending a little bit of your day with us. I hope to see you all on future Postgres conference webinars, and have a great rest of your day.