 I have two friends that were talking a few days ago about engines and they have a lot of money. And they were talking about building something quite powerful with a lot of horsepower. And I tried to understand the discussion and the reasons behind what they wanted to buy. My name is Pepla. I'm a consultant at Percona. I have more than 30 years of experience that makes me feel really very old, extremely old. And currently I'm located in Barcelona. This is why I can take pictures like this one. This was working at the beach. Let's have a quiz time. Name that database. It's a popular open source database. It was originated by a guy with the first name Michael. Michael has a history of saying the most religious things. And he has formed several companies. I guess you already know which database we are talking about. And the answer is Postgres, MySQL, or MariaDB, or Vertica. It's just a funny thing that both companies have in their region the same name Michael, but different Michael's. So how to compare databases? Back to my two friends. One of my friends was considering buying this one. This is a Ferrari Roma with 620 horsepower. The price, it's some months, I don't earn that amount of money. And it's the ideal car if you are a Jason Stratham. And the other one is John Deere 9R640. It's called 640 because it has 640 horsepower. And the list price is around three times the price of a Ferrari. And it's ideal if you are a farmer, a huge farmer. You can do a lot of things with such a tractor. So how to compare things? Well, the first thing you need to do is to check the features. And this is very easy. You just need to build a list of your requirements, compare the candidates to see which one fits best, and don't assume they have the same features. When I said engines, and I said the horsepower, probably you were thinking sports cars, but you have also tractors. And if you are a farmer and you go to your farm with a Ferrari, you will not succeed. So it's very important to consider the features. And you could do that. So this is not something, I mean, you can attend a conference and I could just take the list of features and compare and say this one and this one. But I think it makes no sense. And this is not what this presentation is about. Also, it's not about licensing. Why? Because there's a lot of heat regarding the licensing topic. And there are a lot of different opinions. I'm just going to talk for a short bit, a few words about my sequel. If we compare the advance that my sequel has done in the last years and we check the companies that were ruling my sequel at the time, probably we would get some surprises. I'm not an Oracle employee at all. I'm not paid by Oracle, but for example, my sequel is quite advanced compared with my sequel 5.1. While preparing this presentation, I learned about a concept called the Birmingham screwdriver. I don't know if you know what's a Birmingham screwdriver. I was not aware of that. This is a Birmingham screwdriver. There's something called the law of the instrument. The law of the instrument in Spanish we have sayings about this, but it's just if you have a hammer, you tend to treat everything as a nail. If you know my sequel, you tend to consider that my sequel is better. If you know Postgres, you tend to consider that Postgres is better. And this is something I recommend you to Google, which is better, my sequel or Postgres. And you will find a ton of people saying, I remember I worked with X 10 years ago and I'm a Y DBA and Y is quite better. Well, you are comparing Y today with X 10 years ago and you worked with X for maybe one month and you were testing something. And well, it's a bit hard to compare this using this criteria. So most of the comparisons out there have some bias. Why? Because it's very hard to find somebody with really good understanding of both Postgres and my sequel technologies. The internals are not comparable, feature wise they are, but the internals are not exactly the same. But this is not always bad. What I mean, if God gives you lemons, you make lemonade. If God gives you Postgres DBAs, then use Postgres. Don't try to do something different because you know what? I attended a conference and everybody was talking about whatever database or somebody said in a meeting, why? I remember this is a true story. I remember in a meeting with a client, the client said, why don't we use NoSQL? I don't know. Why should we use NoSQL? The question is never negative to you because why don't we sell the company? It makes no sense. So, disclaimer, I'm a sequel DBA, but I will try to be as agnostic as possible. Maybe I don't exist. If I don't, I apologize because I'm not trying to sell you either technology, especially because I don't believe your criteria should be based on bias. So, I'm just saying that I'm my sequel DBA just to justify if I say something wrong about Postgres. One relationship that I made is pretty much the same as another. Right? Well, no, they are not the same. They do the same, but they do the same in different ways. And it's very important to understand those differences because by understanding those differences, you can decide which technology is better for you. Okay? And this is extremely complex because your requirements are very diverse. There's no simple rule. If you write a lot, use Postgres. If you read or if you need whatever, it's not that easy because it depends on the exact right or read patterns. And sometimes even application developers don't know their access patterns, so it's hard to, it's very complex to decide. So, if we compare Postgres and my sequel, both are relational, both are open source, both are popular, both are old enough to be allowed to drink, and this is sometimes not cool. Okay? Sometimes it looks like being version 16 or being version 8 is not cool and being 0.21 is cooler, but in our case it's a proof that both pieces of software are consolidated and pretty much stable. The differences, if you Google, you will find these are the most important differences. Postgres is supposed to have a better sequel standard support. It's ruled by mailing lists. My sequel is supposed to be easier. It's ruled by Oracle, although the community has a lot of things to say, and the devil is in the details. In Spanish we say where does Vincent go, it means where the people go. It's the equivalent to monkey scene, monkey dude. So what is the people doing? What we are seeing is that open source database trends are growing enough to compare with commercial databases. Actually two years ago, the open source databases became more relevant than the commercial ones. And if we look per database, we have this. So we have that my sequel is trending to be more relevant than Oracle, and we see a huge grow in Postgres in the future. So probably in the next five years or so, we'll see that both databases will be the leaders in the database market. So let's go to a minute. The architecture. There are plenty of differences between Postgres and my sequel. If I have to choose one, this is the one I think is the most important, especially because this is something that is only a characteristic of my sequel. My sequel has a two-layer architecture. One layer is the sequel processing layer, and the other one is the engine layer. What this allows is to have different engines. You can have INODB, but you can have ProxDB. You can have Blackhole. You can have different engines that define how the data is stored. If you really want to have a fast database and don't care very much about integrity, I recommend you use the Blackhole engine. It will be really very fast. Your application will be extremely fast with no data, of course, because it's like that now. But it's a proof that my sequel has an additional flexibility with engines. Somebody can say, okay, what's the sense of having an engine like DevNol? We've been using this engine in the past for replication purposes. So you replicate from one database to another database that has the DevNol engine, and then from that database that has the DevNol engine to another one. Then you have, for example, 5.1, 5.6 and 5.7, and you are replicating from 5.1 to 5.7, which, by the way, in theory is unsupported. So the fact is that when we say my sequel, usually we are talking about INODB, which now is the default engine. In this presentation, probably, I will be talking about my sequel and INODB, and both terms are valid. But if you are using another engine, then it's different. So here is my sequel, my sequel, no. When I said that engines make a difference, they make a difference. And if you need a different engine, that's great, but 99% of database users don't need additional engines. So it does not make a difference for the vast majority of users. Almost everybody runs INODB. But knowing about the layer architecture is good because it helps you understand some of the internal concepts of my sequel and how this impacts performance. To make it short, there's something, for example, called Handler. A handler is how the my sequel layer talks to the engine layer. There are different handlers for different operations. So when you see there's a handler that is prominent, probably means that you are doing a lot of full scans. Postgres, the heap. The heaps don't like. The tuples are stored in the Postgres SQL heaps. The tuples are stored in a heap. And we identify the tuples by two values together called the CTID. These values are the page and the location in the page. Data is unordered. Actually, it's not that data is unordered. The thing is that data is stored in the same order is written. So if you write ordered data, it will be ordered. If you write in a random order, then the data will be stored in a random order. One really important difference is that updated or replaced rows are kept in the heap. This means that old versions of a row, if you update, the old version will be kept in the same table. And in the heap, there is a lot of metadata, but relevant for us in this presentation is these three values, the Xmin, Xmax, and CTID, the Xmin identifies the transaction that inserted the row, the Xmax, the transaction that removed the row or not. And when I say or not, I mean, if I try to remove the row and roll back, I'll have a value here, okay? And the CTID that identifies the next table. So if I replace the row, I will point, this is the CTID of the next table. Those values are 32-bit integers that can wrap around. If you have experience with Postgres, you've probably heard about this. If not, we'll discuss this later. The pros of the heap, the inserts are really very fast, because you have to do nothing, just insert at the end. This is very important. New data is placed together. So this is very good for temporary locality. In theory, the data we are inserting will be related with data being inserted in the same period of time, okay? So this means that somehow related rows will be kept together. It's very simple to manage. It has no complexities, the normal way. And the rows are found by location. So if you know the page and the location in the page, it's very easy. You just need to access that page and then go to the position of the page, and voila, you have the row. The problems, fragmentation. What happens if I start removing data? I will have holes, okay? I can have pages that are not completely full. I can have pages that have just 10% allocated. Rows are found by location. So it was a pro, but it's also a con. Why? Because sometimes certain operations force us to modify indexes. If I modify a row, I will create a new version of that row that will be located using a different CTID. And if I have an index that points to that row, I will need to modify the index, okay? So any change I do, if I reorganize the table for whatever reason, I will need to reorganize the indexes. Anything that I do that can move a row will involve reorganizing or changing all the pointers to that row. These two facts make that you need periodically to rebuild because you can have fragmentation. And also the fact that you place the old versions in the table addresses you to have free space inside each page for updates. And what is important is that you need that amount of space for the whole row. You don't have... If you have a huge table and you change just one file, one field, sorry, you will have a full copy of that row in the same page or another page, but you will have two versions of the row instead of just the modified field. The InnoDB cluster index. In InnoDB, everything is an index. All the tables are an index. This means that, for example, you always need a primary key. If you don't define a primary key, InnoDB will create one for you. And the problem is that it will be hidden. You will not be able to access that primary key. But there will be a primary key. When you do an update, that row is modified on the table, even before the commit. So what InnoDB does is assuming that you are going to commit. So if you modify a row, I will change it. And if you commit, that's it. I have nothing to do. If you rollback, then I have to undo the changes. And the metadata here is just two fields. The transaction ID that identifies the last identifier of the transaction that modified the row and a pointer to a rollback segment. A rollback segment is where we store the old versions of the row. In Postgres, the old versions were stored inside the table. And in MySQL, the old versions are stored in rollback segments. Pros. Data is stored by primary key. This means that it's sorted. So if you have to do often range scans based on primary key, data is sorted, the rows are together. You have no fragmentation. This is something people is not aware. The fact that you have an index means that when a page is empty, it disappears from the index. You can release it. While with other engines, an empty page is in the middle. So there's no access for an empty page, so you just mark it as empty. And actually, if it's not fully empty, the engine is clever enough to look at the surrounding pages to see if it can merge those pages and free it completely. So if we have two pages, 50%, the engine will merge them and will release that space. So this means that in the worst case scenario, you will have 50% allocation. But this is really very unedged case because you need to remove rows from each page just to make sure that it's 50%. In this case, you need to reveal, but I've seen this, I would say never. Rows are found by primary key, which is good if you have primary keys because you save going to the primary key index and then going to the table. So you just go to the primary key index and the table itself is the primary key index and you find the row. And it's really very good at sequential inserts. Why is good at sequential inserts? Because it makes a trick. It stores in the page the direction of inserts. So while you insert data in the page, when it arrives to the end of the page, the page is full, it says, okay, I've been inserting values that are growing. Yes, this means that I can keep this page as is and open a new page empty. And the next values will go to the new page. If the inserts are random or go in another direction, I'll create a page on the other side. And if they are random, then I'll consider if it makes sense to split the page. So if we do sequential inserts, it will create pages almost completely full and one after the other. The cons, it's very bad if the inserts are not sequential because you tend to create empty spaces. You need a primary key. If you don't use it for whatever reason, it's a table that has indexes but doesn't have a primary key. You need to generate somehow a primary key. And it's also a bad thing sometimes rows are found by primary key. What does this mean? In Postgres, we had that the index had the CTID to find the row. In InnoDB, what you have is that the secondary indexes have the primary key. This means that if I have a large primary key, my secondary indexes will be very big because it's not something as an integer for the page and an integer or maybe 8 bytes for the row inside the page. No, I can't have, if I have a primary key that's large, then all my secondary index will have inside that primary key. And also when I access using a secondary index, I access to two indexes. First, the secondary index and then with the primary key, I go to the table itself is an index and I need to do the index access. Secondary indexes, I'm using my SQL wording because in my SQL, a secondary index is, as the primary table is an index, all the other indexes. In this, the sense I'm using here is all the indexes that are not primary key. Postgres has something great. You have a lot of different types of indexes while in my SQL, you just have one type. The problem is that if you rebuild the table, you need to rebuild the indexes. If you migrate or move for whatever reason a row, you need to migrate or move that row in the index. Even if you are not modifying any of the columns in the index, the good thing is that the indexes are smaller because the CTAD is not a big one. In the case of, you know, DB, you have only B3s. The indexes are larger because you need the primary key. The good thing is that some rebuild operations do not affect the index. For example, a page split, okay? A page split does not need to notify all the affected index that something has happened on the primary table. There is just one version of the row. This is not completely true. If a row is deleted, the old version is kept but is marked as deleted. And there's something called the change buffer that delays writes to secondary indexes. And it delays writes for a really long time. Actually, usually it flushes the writes when that page is accessed. So if it is not accessed, it can be a long time without making the change active in the secondary index. In BCC, well, this is multi-version concurrency control. I guess, more or less, everybody knows what does this mean. In our databases, we have a ton of users, a ton of transactions running at the same time, modifying data, writing. We have to see the old versions because I'm running at the same section while some bodies incrementing values and things like that. So dealing with all these concurrent access, read and write access, you have different methods. And one of the methods is the multi-version concurrency control. Both databases use this method. The trick here is that you have different versions of the row to avoid locking access. How this is implemented in Postgres, rows are inserted and deleted. You never update. If you update a row, what you do is mark the old version as deleted and insert a new one. This means that you generate a new CTID, blah, blah, blah. Those new rows are stored in the same table, and often, if possible, in the same page. This means that if you have a lot of changes in certain tables, you will have a lot of old versions stored inside that same table. The Xmin and Xmax, as I said before, well, the CTID is different for each version. There is something to mitigate this. It's called the hot updates. The hot updates in theory sometimes allow you not to change the secondary index when there's a change in the heap table. But you can only use them under some circumstances. And you have the two values we said before that identify the transactions that modified that row. And you have also something called the commit log. This is a list of all the transactions that have ever been executed in that race with the state of that transaction. This list is cleaned when you do something called vacuum. But if you don't do it, you have a list of all the transactions saying, transaction one, committed, or transaction one million, aborted. How do I know if a row has a value I need? If the Xmin value in the row is bigger than my transaction ID, I'm a transaction ID 10, and I go and see that the row has a value of 11. This means that this row was created after I was born. I started as a transaction because it has a number that is higher than mine. So I should not see that row. If the transaction ID of the row is smaller than mine, this means that the row was created before I was born. So in theory, I should be able to see it. But now the problem is if that row has been or not modified, to know if the row has been or not modified would be deleted. This means that we have Xmax defined. If Xmax is empty, the row was created before I was born and is still active. I can see it. If Xmax is defined, I need to make sure that the transaction that deleted the row has committed. Then I have to go to the commit log and look at the transaction and see it's active. If it's active, I shouldn't see the version. If it has committed, I should see the version. If it has rolled back, I shouldn't see it. So it's a bit, I need to look at different values just to make sure that's the version I have to see. In some cases, I have to check the transaction. I need to go to the history and say, this transaction was committed. Who removes the old versions? The vacuum cleaner. In the case of InnoDB, it's based on rollback segments. All versions are stored in the rollback segment, assuming that commit will happen. So we change the table and store the old version in the rollback segment. In this case, we have a row metadata that allows us to find the correct version. I will not get into the details. But I want to say something. Long-run intersections can require a lot of rollback segment accesses. In the case of Pospires, we have a problem. All the versions are in the table. And sometimes they are together. And this is a problem because it consumes a lot of space. But this is great because if I need to find different versions of the row, probably they are together. They are in a page that I already have in memory. So I don't need to access this. In the case of rollback segments, if I have a long-run intersection and lots of data has been modified, I can go to look for a row and say, oh, it's modified. I have to go to rollback segment. Oh, somebody modified. I have to go again to rollback segment and again to rollback segment. And as the rollback segment changes happen in time for all the database, the changes are not together. So maybe I'll have to go to disk to read the rollback segment. So long-run intersections can do a lot of harm in my signal. Here are the details. And one of the good things is that you don't need to do vacuum. There is a background process that purges all rollback segment entries. And this is transparent. The vacuum is something very specific of posclas and is critical. You can disable vacuum. Don't do it. What vacuum does is it cleans the old versions and that's perfect. That's great. But it also does something. In posclas, the transaction IDs are small. I think there are 32 bits. This means that if you have a lot of transactions, it can happen. Actually, it will happen that you will have repeated transaction IDs. And as we said, data is stored with the transaction ID. So if you repeat transaction IDs, how you know if this roll is in the past or in the future. Imagine you find a roll that has a transaction ID, 10 plus your transaction ID. So the trick here is that by doing vacuum, what you do is freezing some rows. There is a special transaction ID that freezes the roll. I think it's the transaction ID too. So what vacuum does is it looks at the table, removes old rows and for the rows that have to stay, it checks the transaction ID. And if it's an old transaction ID of a transaction that is not active and was committed, then it says I'll freeze that roll. I used the transaction ID too. And once all the transactions, sorry, all the rows of the original transaction have been frozen, I can remove the transaction from the commit lock. So the vacuum will purchase the commit lock and also it will mark some transaction as frozen. If you don't do this sooner or later, the database server will say, oh, I have to stop. I can't continue because there is a risk of transaction wrap around. Now, the full vacuum cannot be done. You need to have full access to the table. But to freeze rows, you don't need full vacuum. Full vacuum is only needed if you want to release the space allocated by deleted rows. Yes. Yes. If you want to make a vacuum, you need to have at least the same space that the table has. This is a general rule. If you want to recreate an index in MySQL, you will also need that amount of space. This means that if you have a several terabytes table, you need several terabytes free space. Postgres has something specific, very things that MySQL does not have. One of those things is materialized views, which is very helpful in terms of performance. And while the list is quite long, I will, I'm running out of time so you can find this. There's something interesting is the visibility map. The visibility map is a map of the pages that have changes or not. And it's used to avoid vacuum looking at pages that have no changes. So when a page has changes, there's an entry for that page in the visibility map. And this helps make the vacuum process faster. There's also something called toast. The oversized attribute, the storage technique, this is quite similar to what DB does. If you have data that does not fit inside a page, you store that data out of bounds. You store it somewhere else. Postgres has roles and sorry, Postgres has roles and are more popular than roles in MySQL. MySQL also has roles, but the implementation is a bit dirty because it uses, actually it's like users. The syntax is pretty much the same. It's not well implemented. It's one great thing. You have things like filter, for example, this is great and MySQL does not have filters. Regarding availability, this is something that MySQL has really very well resolved. We had in the past, we still have, but we had Galera and now we have group replication. And in a DB cluster, but with group replication, you already can provide a good high availability solution. Group replication is quite similar to Galera. And so we are running out of time. Also, something that is a difference, but everybody has an opinion about this. MySQL uses threads and Postgres uses processes. What is there? Who do you love more? That or mom? Regarding memory utilization, MySQL tends to manage the memory and Postgres tends to rely more on the operating system. What is there? I have an opinion, but I will not share. It depends if it's a dedicated server or not. If you're running a dedicated server, assigning all the memory to the database can make a difference and can protect you of that. That means that decides to BI a 20 gigabytes file and consumes all the file system buffers with a lock and expels all your database pages from memory. But at the same time, if you have a server that has different applications, it gives you more flexibility. For the MySQL guys, if you overallocate memory, you will find that all of a sudden your MySQL server dies because the home killer says, does somebody's consuming a lot of memory? It's different. Different does not mean better. Thank you very much.