 I work at AWS as part of Enterprise Support Group, and as my day-to-day work, my primary responsibility is to work with customers to make sure that they're following all the database best practices. They know how to operate their databases at scale in the best possible way. And one thing that I enjoy about my job is I get to work with customers who are constantly pushing the boundaries of what is possible with the database. So I come across customers who are running things at scale, and they're running to challenges that are only possible at the kind of scale that they operate. And today I'll be talking about how we innovate for customers of that kind of profile at AWS. How do we innovate on behalf of them with PostgreSQL? Now before I get started with my talk, why PostgreSQL? I hear a different flavor of this question all the time. I hear about that at conferences like these. I hear about that. I get asked about that question at user meetups. And even at customer meetings, why PostgreSQL? And the answer to that is built around the way PostgreSQL community operates. It has got strong guardrails around that ensures that the project or the community is not owned by any one company or entity. The community focuses on very high quality code and stability. And it focuses on building and releasing features that are useful to the user community, to the large PostgreSQL user community. And a testament to these things, you know, things like it is not owned or controlled by a single entity, and it is focused on users. A testament to that is this number, 37. Any guesses what this number is for? That's how long PostgreSQL has been around. That's the age of PostgreSQL. It turns out it has been around just a little bit longer than I have been around. So just a bit of trivia fact there. Another trivia about me and my association with this conference. This picture is from Feb, March 2016. This was the first FOSS Asia where we hosted a PGD track. I was part of the organizer team at that point of time. Apart from me, people who helped me run that conference were Bruce, Nikhil and Umair. You can recognize this, this seems like from a different lifetime, right? Because you can see Bruce Womjian. If anyone has seen Bruce Womjian at one of the recent conferences, you'll always see him going around with a bow tie. And this is from an era where he was not using a bow tie at conferences. So you can see that it's from a different lifetime, not just from me, but even for some of the other folks from the community. All right. Now, let's take a look at PostgreSQL at AWS and Amazon. We have got a true open source PostgreSQL as a managed service in Amazon RDS for PostgreSQL. And then there is a cloud native PostgreSQL option where you can deploy your PostgreSQL workload which can scale and provides you high availability with all the cloud native features with Amazon Aurora with PostgreSQL. And we contribute to PostgreSQL community. One of the ways that AWS contributes and supports the community is by sponsoring events like PG Day Asia, Ad Foss Asia, PG Conf India, basically sponsoring conferences. There are a lot of other aspects in which we contribute back to community through developing patches, reviewing codes, code and patches designed by somebody else. We are recognized as a major sponsor. We also employ people who work on contributing features to PostgreSQL community. Jonathan Katz, who is my colleague, he is a core member in PostgreSQL project. Apart from that, we've got six committers, seven major contributors and three recognized contributors who work on features that are contributed to PostgreSQL community. So with all of that, you will see that the contribution from AWS has been increasing. The contribution from AWS to PostgreSQL community has been increasing over the past few years. And if you look at PostgreSQL 16, about 19% of the features were contributed by AWS. And the contribution comes in a variety of ways. Authoring patches, reviewing somebody else's patch that was written by somebody else, participating in design discussion on mailing group, helping features get committed. All that is part of contribution that comes from AWS folks. Now, here are some of the key contributions that were contributed by AWS in PostgreSQL version 16. And like I said, it comes in different forms and ways. And we continue to contribute for future releases. We see some very peculiar behavior with some of our RDS and Aurora PostgreSQL customers. And that kind of behavior is posed only at the kind of scale at some of these customers operate. And we take that learning and we contribute performance enhancements and bug fixes back into PostgreSQL community. And we continue to do that even for version 17. If anyone of you has done a major version upgrade with a database that is participating in logical replication, you all can understand how painful it is to first drop the slot, find out the consistency point, perform the upgrade, then create the slot again, let the replication resume and avoid any kind of data loss on your target system. That is one of the things that AWS is working on to add ability to perform PGA upgrades without having to drop replication slots, which is going to improve availability. So we have all these features that we are working on that will be part of future releases. If you're keen to learn more about how AWS contributes back to community, go ahead scan this QR code. This will redirect you to a YouTube video where you can watch Jonathan Katz. That's the guy I was talking about earlier. You can watch him talk about how we contribute and in what ways we contribute to PostgreSQL community. Okay, now a little bit about Amazon Aurora. It is a cloud native, it's a cloud native database engine. We designed it to meet the needs of enterprises with demanding requirements. These are customers who are tired of working with legacy databases because of the punitive licensing, their significant expense and the lack of cloud native capabilities. And Amazon Aurora is our answer to those needs. At the P-SQL prompt or for that matter, any client driver that you choose to use, Amazon Aurora feels and seems like PostgreSQL or MySQL, depending on what compatibility of the engine you choose when you launch the cluster. But under the hoods, underneath that, we have got a distributed self-healing fault tolerant storage system that can scale up to 128 terabytes. And it manages replication to six copies of data in three availability zones and allows you to add low-latency read replicas within a few minutes. So it allows you to scale in a true cloud native way while maintaining availability and offering really high performance. And we aim to do that at a cost lower than the commercial and legacy databases. If you want to know more about how Amazon Aurora works and the underlying architecture of Amazon Aurora, how the replication works, how to achieve high availability, disaster recovery with Amazon Aurora, all these native features, if you're interested to learn more about them, this is another QR code and I'm going to share a couple of more of these which is going to take you to another re-invent talk, which is where Tim talks about some of these key concepts. Okay, all right. Now, I'm not going to talk about all the features of Amazon Aurora, but something that I'm personally very excited about is the launch or release of Amazon Aurora PostgreSQL 16. The reason why I'm excited about this particular launch or release is it demonstrates our commitment to maintain currency with major version releases. Instead of creating a fork, we decided that we are going to maintain currency with new version releases as and when PostgreSQL community releases a new version. We're going to keep up with that because that's very important for our customers. It helps build trust with our customers. It helps ensure there is no lock-in and that's very important for our customers, so it's very important for us. Now, let's talk about a few launches that happened recently. All right. So, Amazon Aurora has this distributed fault-or-rent storage system that I talked about. All these storage are your request that your instance has. They are sent to the storage system and it can scale as your workload scales. The IOPS can scale as your workload scales. So, you don't need to provision IOPS. If you're using databases on EC2 or on RDS, you need to provision IOPS. With Amazon Aurora, you don't need to provision IOPS and it will scale on demand. And you pay on demand as well. That is the standard storage for Amazon Aurora. But our customers have been asking for a predictable IO cost. And IO-optimized cluster is our answer to that need from our customers. At the cluster level, it's a cluster level configuration that will apply to all instances and it allows customers to have a predictable cost model as well as give them a better price performance index. It is available on both Intel as well as Graviton instances. They can use it with provision instances or with serverless instances. They can also use it with reserve instances. All right, now, with that improvement, with that change or with that launch of IO-optimized clusters, we were also able to make certain improvements that were focused on workloads that do a lot of IO. And one of that was around the batching and fewer of the stuff. And that allows right heavy workloads, especially when you're running on a larger instance, to scale up much better and give you up to 50% better performance as compared to a standard Aurora cluster. Now, if you are using a smaller instance class, let's say 2x large or 4x large, you're still gonna see some performance improvement. It may not be of the same magnitude, but you're still gonna see some performance improvement. We have another new feature which is for IO heavy workload for two different kind of profiles. One where workloads create a lot of temporary objects and the second one where the working set could be larger than the available RAM. And I'm gonna talk about both of them. All right, so this is gonna involve a little bit of animation as I walk you through. So I'm gonna turn towards the screen. All right, so in PostgreSQL, what happens? If you're doing some kind of sorting operation or create index, it is gonna spill to disk if the amount of data you're working with is bigger than the work mem or maintenance work mem. And as it spills to disk, it's gonna be written to disk in temporary objects or temporary files and then be read back into memory at some point in time. Now, with Amazon Aurora, that happens to be the EBS volume, which is 2x the size of memory. It works fine most of the times. But now with read optimized instances, which is basically instances with NVMe storage. This is R6 ID and R6 GD instances. There you get a NVMe storage attached and what is gonna happen is that is going to be used for temporary objects. So instead of using the EBS volume, you get this NVMe storage, which is closer to your instance. So it works the same way. It's just that it's gonna improve performance because it's so much faster to write it to the local instance attached storage. Now, this is with standard Aurora cluster. If you use IOPTIMized cluster, then we only provision 2x of the memory for temporary objects. And you must be wondering, what is the rest of the storage going to be used for? What about rest of it? And I'm gonna talk about that now. So the rest of that is going to be used for tiered cache. Let me talk about that, what I mean by tiered cache. Now, in PostgreSQL, typically what happens if you try and read something, if it is found in the shared buffers, it's going to be returned. If you have got read queries that are frequently accessing the buffer, that is trying to access a buffer that's not in memory, it's going to be read from storage, right? So it's going to be read from storage. And if you keep accessing that page frequently enough, enough number of times, it will remain in the cache. But at some point, your connections may drop off or the kind of workload that was using these pages may go away and then the page will be evicted from the cache. This is standard PostgreSQL behavior. Now, what we do is with NVMe storage, we introduce the forex of your available RAM, forex the time of memory as tiered cache. We reserve about 4.5% of the shared buffers for maintaining metadata. And what we do is if a page is not found in the shared buffers, it is going to first check in the metadata, in the tiered cache metadata. And then it will see if it is available in the NVMe storage. If not, that is when it will read it from the storage. Now, it's in the storage, your connections are using the page fair enough. At some point, the connections will go away or that workload is gonna go away and then it will be moved into the tiered cache. This happens asynchronously, so we are not blocking any workload, we are not blocking anyone. Now, the next time it is needed, it is going to be read from the tiered cache. We're gonna check the metadata and see if it's available in the tiered cache and we are going to get it, bring it from the tiered cache. And if there is an update happening, so the challenge with updates is if you try and if you go and update the cache, that is one option. The second option you have is to evict the page from the cache whenever an update has happened. We choose to not do either of those and instead we just update the metadata. So we just invalidate until in the metadata, we say, hey, there was this block that you could refer from the tiered cache. You don't need to do that anymore. You can just go ahead and get, next time you need it, get it from the storage. So we just update the metadata about that. We just say that this page that was there in the tiered cache is now free and then it's going to be evicted. All right, now let's put that into some tests. Now, this is a very specific test. I wanna call out the details of the test because performance tests are always nuanced, there are always nuances about performance tests, right? They can't be generally applicable. Now, this is a test when we are trying to test with uniform spread keys and even the select queries are trying to uniformly select data from that same range distribution. So basically we are saying all my data is my working set. And if my data happens to be 85 GB in size on a 4x large instance, which with the amount of memory available in that instance class, your shared buffers is gonna be somewhere around 80, 85 GB. So we are saying, we are working with the data which is pretty much fitting into the shared buffers, right? So it doesn't matter whether I use tiered cache or I don't use tiered cache because everything is in shared buffer, right? It's very fast and you can see the effect of that. You see most of the queries, the latency is very low and the average latency is 0.23 milliseconds. But as the data size increases, if I go to 4x the size of data, now the latency also becomes 4x. If I go 8x the size of data, 8x the size of shared buffers, roughly 8x the size of shared buffers, then my latency is 5.5 times. And you can see that it's less distributed. It's more on this side of the graph, right? You have more queries that are slower, yeah? Because they have to wait for storage IO. Now, if I used tiered cache with that, let's say with 4x of the size of the shared buffer, roughly, then you're gonna see that it is not as fast as memory, but it is not as slow as storage. It is somewhere in between. So it's only 1.5 times the latency that you see when everything could fit inside of the RAM, right? So it's not as bad as reading all the time from the storage. And then you have got tiered cache, using tiered cache with about 8x the size of data. Now, in this case, not everything is gonna fit inside the tiered cache, right? Some of the times it is even gonna miss from the tiered cache and it's gonna have to read from the storage, but it's not as frequent. And that's why the latency is slightly better than reading everything from the storage, right? All right. One place where this is extremely beneficial and the benefit of that is quite distinctively visible is with vector database. If you're storing vector data and indexing them, then the size of index is typically gonna be larger than the size of data, right? And in those use cases with tiered cache, you get up to 9x performance improvement. And this is a test that is with a 128 terabyte size of database total, the index size is 720 GB. So you can see that it gives about 9x the performance difference on a 12x large instance. All right. Grant McLister, who is a senior principal engineer on Amazon Aurora team. He talks about these enhancements and features in great detail. It's available on the re-invent YouTube channel. That place is a gold mine for anyone who is trying to learn more about AWS services. So if you're interested, go ahead and check that out. All right. Now I'm gonna talk about Amazon Aurora limitless databases. That's the launch that I'm also excited about. I'll talk about the high level overview, but then I'm gonna talk about a very specialized case of limitless databases, which is concerned about maintaining the transactions semantics of PostgreSQL in a distributed database. And my idea of sharing that is to share how some of these concepts could be different when you are going to limit to a sharded database and the design complexities behind that kind of engine. All right. Now there are three ways to scale. You work harder, smarter, or you get help, right? Harder means you increase the capacity. You go from 12x large to 16x large or 25x large instance on AWS. Luckily that is push of a button. So you can go to a bigger instance. You get more work done. Work smart means IE optimization in your workload, using IE optimized instances for certain workloads, read optimized instances, sorry, optimized reads. And then you can also tune your workload. We are gonna talk about the next talk. I and Ronila are gonna talk about how you can tune PostgreSQL performance using wait events. And then you can get help by sharding your database, right? You can also distribute the work. If it is just read workload, you can add more readers in Aurora cluster and that helps you scale. But when it comes to writes, it is a little bit more complicated and that's where the common approach is sharding the database. Now limitless distributes tables across multiple Aurora clusters which are served through serverless compute instances. It allows you to scale data size up to the petabyte scale. The write throughput will increase as the workload goes, because the workload goes to different instances, different shards and these shard instances, they also can scale up and down because they're using Aurora serverless V2 technology. And this is a new level of scaling while still maintaining the semantics and experience of relational database that you get to see with PostgreSQL or Aurora PostgreSQL. All right, it is PostgreSQL compatible because that is very important for our customers. I emphasized on that earlier. It's fundamentally Aurora PostgreSQL, right? You continue to use the same open source drivers, same client drivers that you use today, you can continue to use them with Aurora limitless as well. It's PostgreSQL parsing and query semantics. There is a broad surface area of queries that are supported. Now PostgreSQL has got a very broad range of queries that it supports. It's hard to support all of them in a distributed system, but Aurora limitless supports majority of them. A broad range of that is covered. And it also supports specific Aurora PostgreSQL extensions, so extensions that are supported on Aurora PostgreSQL, some of them are also available on limitless. There are gonna be three different type of tables or entities that you can create, sharded tables, basically tables which are distributed across different shards, then reference tables, tables which are strongly consistent across the shards. They will not be, there will be a copy of them on each shard, but they are not replicated eventually. They are strongly consistent. They are written to each shard in a strongly consistent manner. And then the standard Aurora PostgreSQL tables, which is just one copy. Of course, the data of that table has got six copies in three availability zones, but from a user interface perspective, you've got only one copy of that. Right. Now, as I mentioned earlier, I'm gonna talk about how we manage, how we manage to maintain the same PostgreSQL transaction semantics in limitless. Now, before we go further, a little refresher, right? Read committed is the read committed transactions is where whenever I start a query, I set a horizon, right? And that horizon is basically my transaction snapshot and whatever I'm going to read is going to be from below that horizon, right? All the data that was committed before the query started, I'm gonna be able to see that. Any transactions that have committed after I started my query or any transactions that started after I started my query, I'm not gonna see an effect of them, right? Any transactions that were uncommitted when the query started, I'm not gonna see their uncommitted changes. Reputable read is a little bit more strict. It says that your horizon is not gonna be set at the query start, but at the transaction start. All the exact same things that I talked about earlier, but it is gonna take effect from the transaction start time. So anything that was committed after the transaction started, the queries are not gonna see that, which also means that if you repeat the same query multiple times, you're gonna see the exact same data set, right? Everyone with me? Cool. Now, the way Postgres works, as I mentioned, it kind of sets a horizon. It uses snapshot. Snapshots have got a transaction ID. And the idea is that whatever has committed before that transaction ID, you're gonna be able to see only that. Postgres achieves that by using MVCC model. And in that model, every row that you create, which means even inserts and updates, they also create a new row version. Each row is stamped with a transaction ID as well. And that's how snapshots pretty much work. On a high level. It is a little bit more complicated than that, but I'm just trying to keep things simple here. It turns out it is a lot more complicated and difficult to maintain that single system semantic in a distributed system, right? Well, you can do that, but you're gonna have to compromise on performance then. You'll have to sacrifice performance. If you wanna really maintain performance and scale performance as you go across shards, it becomes more challenging to maintain that consistency and semantics. Now, as we talked about Postgres tracks each transaction and snapshots, they track transactions. Now, when you're scaling out, you wanna do things in parallel, right? But if you use a coordinator, that coordinator becomes a point of bottleneck or a point where things start getting slowing down, things get slowed down. So we can't use a transaction coordinator or a central coordinator. And the transactions can be short. They can be long. A transaction could go to one shard or could touch different rows that are sitting on different shards. It could be implicit. You just start a transaction by saying update table or it could be explicit where you say begin transaction. So in that way, we don't want the user experience to differ anything from what they experience on Postgres SQL. So we don't want that their experience to be, hey, you need to be declarative or you need to define in the very beginning of what this transaction is gonna do. We wanna maintain the same experience, right? So how do we do that? Now, one thing we do to solve that, the way we solve that is by using clocks. These are very specialized clocks and I'm gonna talk about that. I've got about 10 minutes and like three, four slides. I think I've got time to explain this in a little bit detail. Okay, so let me, instead of trying to talk about what's there on the slide, let's do a simple mind experiment here. Everybody has got a clock, right? Now, if you try and look at the clock and note down the timing to the granularity of seconds, let's say, and write it down on a piece of paper, pass it around to your neighbor. It is possible, if you try to do that even very quickly, it is possible that the time you note down is gonna be ahead of the time that your neighbor writes down, yeah? So you say, this is my present and your neighbor says, that's my future and you won't be able to agree on time, right? And if you go around the room, pretty much nobody will agree on the time, right? There's probably a possibility that maybe two people will agree on a time, but not everybody. Now, instead of doing that, if I said, let's try and write down a time that is definitely, so let's put a time boundary instead of trying to define time. So then you can say, well, the time is definitely between, what's the time now? Let's look at the time now, okay. The time is definitely between 2 p.m. and 2 o'clock 5 p.m., or yeah, 2 o'clock 5 p.m., yeah. I was just making sure if I'm suffering from time zone, that would make things even more complicated, yeah? But let's say that we agree on that the time is between 2 and 2 o'clock 5. Now, if you pass around the paper, pretty much everybody is gonna agree on that, yeah? Yeah, it makes things a little bit simpler. It turns out we can get that exact same experience from EC2, from time sync, which gives us current time, the earliest possible time that says the time is not gonna be beyond this. The time is not gonna be earlier than this and latest possible time, that is the time is not gonna be later than this. And this is with microsecond, some millisecond of granularity, some millisecond precision. And we use this information to embed into tuple visibility, row visibility in PostgreSQL, the MVCC concept that I talked about, creating multiple versions, yeah? We use this for global read after write consistency. And I'm gonna talk about that in a quick demonstration of how it works under the hood. We use it even for two phase and one phase commits. All right, now let's look at a concrete example of this, yeah? Again, it's a lot of animation here, so I'm gonna look at the slides. So, you start a transaction in reputable read, yeah? Start with a select query. It's a select query with an account ID, primary key, so it's gonna touch only one shot, yeah? The time of that is T100. So, the router is one of the nodes that is gonna accept queries from the clients, yeah? And shot nodes are basically where the data sits for the sake of simplicity. So, what is gonna happen here is the router is gonna accept this query and say, I'm gonna assign time 100 and it's gonna request the shot to execute that query at time 100. And shot is gonna go and look for rows that were committed as of time 100, yeah? And this time 100 is the earliest possible, it's not just time 100, it's the boundaries, yeah? Whenever I'm referring to time here, always think about the boundary that I talked about, okay? Now, in another transaction, concurrent transaction, I start a select query. It's gonna get the select query and update it, okay? So, I'm gonna lock a row. Again, this is at a single shot. I'm gonna lock this row and I'm gonna do an update. When I get the row, I get it at time 103. Then I get the row at time, then I update the row and perform the commit at time 110. And the shot is gonna assign a commit time of 110 to that transaction. There is another transaction in a different session that runs after the commit. It executes with time 125. So, it's gonna see the changes that were done by the previous transaction, yeah? Because it is starting with the time that is later than that. So, it's definitely gonna see that commit. Now, if I go back to the first transaction and now I start a new query, which is happening at a time later than this, the transaction still has got a time assigned to it and it's gonna reuse the same time to get the row. And even if it's going to a shot that it has never touched yet, it's still gonna ask from that shot, hey, give me the rows committed as of T100 because that's my transaction horizon. So, we embed that information, the time information in the transaction snapshot and in the row visibility as well. Now, here is a different example, read committed should be simpler with read committed, yeah? If we could do repeatable read, the example of repeatable read, read committed should be simpler, but I'll use a different case here. I'll use a different example where we go across multiple shots. So, let's do a sum of account balances that is gonna touch all the shots, yeah? So, we run this query, it runs at T100, it gets the data at T100. Then I start a transaction that's gonna do a balance transfer, yeah, debit from one account, credit to another account. Both of them are touching two different charts and update statements start at different times, yeah? So, one of them goes and goes there at one or three, the next one at 107 and then I do a commit. This commit is going to be sent, it's going to be done in a two-phase manner because you've got two different charts, we are touching two different charts here. The router will ask the charts to prepare for the commit and they will assign their own timestamps and these times could be different, as you can see here in the example and they prepare, acknowledge back to the router, the router is gonna assign a commit time and it's gonna ask both of these charts to commit as of that time, yeah? So, the router is deciding and the router is then gonna acknowledge back to the clients but before acknowledging, there is a little trick here. So, all databases ensure that when I'm gonna commit something, I have to make sure that it's durable on storage, it's durable on disk but addition to that, they're also gonna make sure that it is ensuring that the time is past T120. So, the router assigned the time T120 and we're gonna make sure that the time is past T120 before acknowledging back. Now, it might sound like we are adding latency here, we are adding time to the commit time but this is happening asynchronously, this is happening in parallel to disk IO and storage IO network calls. So, you're not gonna see any latency at all and if ever you do see latencies, it is gonna be very, very minute because we're not talking about seconds here, we're talking about sub millisecond time ranges, right? Which we get from EC2. So, it's not gonna be that challenging now. With this concept of town boundaries, we are also able to make sure that there are no town transactions in a snapshot. So, when you restore a snapshot, you get the whole system, right? You don't get partial transactions, you don't get town transactions. I'm almost out of time but if you wanna learn more about Aurora Limitless and how it works under the hood, an example or a demonstration, go ahead and watch this re-invent video where Chris and David, they talk about this at great length and if you're also interested, you can sign up for the preview. Their talk also has a link where you can sign up for the preview and if you're interested to learn more about AWS, sign up with AWS Educate today. AWS Educate is open to any individual regardless of whether you're a student, whether you are a technician or regardless of where you are in your learning journey. You can go and sign up and you can start learning from the courses that are available there. All right, almost the time. Thanks everyone. Thanks for listening. About this is the commit. Who decides on which timestamp, which timestamp will be applied to commit on different places? So the way it works is each of the shot will prepare a commit at a time, as you can see here and then they will let the router know, then router will assign a commit time. Yeah, and it will relay that information back to Shards. It will wait until that time has passed before acknowledging back to clients. Make sense? Two or three years ago, we already tried to commit such feature in Core. It called CSN maybe. Why, at first, why you don't participate in this solution didn't take your time. And second one, we stuck in into the Microsoft patents system, Microsoft patents. It's a legal problem. How you avoid this problem? Because patent covers this technique mostly, I guess. If I got your question right, you're saying there was a similar kind of feature that was being built and distributed commit that was being built and contributed back to community. Why didn't we use that? Okay, okay, okay. That's a great question. One of the things that I can think about is this is specifically built with Aurora and there is a lot of things that happens here. This is just one of the examples. We're also handling multiple Shards and these Shards are gonna have hot partitions. The challenge of hot partition or a hot Shard where most of your reads and writes are gonna be focused. There is an element of scaling up where serverless will scale up, but at some point you may have to redistribute your Shards. And we handle that by using the underlying feature of Aurora Storage to replicate at the storage level and using clones to redistribute the Shards, which is not exactly something that can be done with storage attached to your database. But the second part of your question, I'm not the best person to answer that or explain why certain design decisions were made, but I'll be happy to connect with you offline and connect with you somebody in our service team who can maybe take that question or maybe we can understand what the feature was about. Sure, CSN you say? Okay, sorry. I try and follow the hacker list sometimes, but this is a discussion that I'm not aware of. Okay, I'm not aware of that. Okay, I suggest we take this discussion offline. I think, I mean, these are definitely both sides of your interest in this discussion. One last short question. Yeah, please. Hi, thanks for an interesting talk. I didn't understand which strategies are available for this Sharding. Can Shards split when they are overloaded from the CPU point of view, for instance, or just by size? So there are, if Shards are overloaded from the CPU point of view, it's just gonna scale up because underneath each Shard instance is basically an Aurora serverless V2 instance. So it scales up, it scales up in capacity and you don't have to manage that. When you define an Aurora limitless cluster, you just need to define the total number of ACU that's our Aurora capacity units. It's basically you're trying to define a budget. They're saying, hey, here is $100, go figure out how you wanna spend this. You wanna spend it on this side of the workload or that side of the workload, you scale up instances as and when you need to. But at some point, you may end up having scenarios where your range for the Shard, there is a particular range that is more popular than the other range. And at that point, you may wanna break that out and have two more Shards out of it. And that's when it will automatically do that for you to kind of manage the heat. Okay, let's think to speak again. I have to say sorry, I'm the chair, but I was a bit late because I was discussing about Postgres and the cloud with Fujitsu team down there and completely lost the time. Very sorry, everyone. We'll continue immediately with one speaker continues here about tuning Postgres with respect to weight events. But I guess they have had enough of me, so I'm gonna talk to Mike. Okay, you pass on the Microsoft. So let's get the speaker, great. Thank you everyone. And welcome to our session on tuning Postgres performance using weight events. My name is Ronil Kumar and I'm a city with specialist solutions that will be primarily on Postgres at the moment. And I'm based in Sydney, Australia. So, and my function basically is to design the best cloud as well. So, what have you heard from Samir? So Samir, if you could want to... Hey, hi, I'm Samir. Once again, I talked about what I do at AWS. I'm gonna pass it back to Ronil to continue. Yes, man, thank you. So in terms of what you're gonna be talking about today, the primary things are actually going to be how to actually monitor slow running SQL statements on Postgres, some of the ways of actually looking at locks in Postgres and how we can actually go in and do session monitoring and sampling weight events on Postgres. And then finally, we'll finish off with a case study on identifying bottlenecks using weight events as well. So, if you look at Postgres, Postgres allows you to actually go through and log slow running queries by setting a threshold for a parameter called as log duration statement. And you can also use the extension on Postgres called as PGS test statements to identify average and maximum sort of time taken to actually execute some of the top running SQLs on Postgres. And the PGS test statements extension also provides other useful information onto the users, like things like your buffer hits that are there. This file writes that are actually getting occurred. Your temporary file is then on top of a number of the CPU weights that we'll actually see here as well. And your workload can typically bottleneck, get bottlenecked by the range of parameters that are actually on the screen, things like CPU, IO, and it can also be your object level weights that are there. And typical examples of CPU bound workloads are those with high concurrency in where the working set fits inside the shared buffer as well. IO's on the other hand is the most common bottleneck for some of the large databases, but where the working set actually fits or there's a small set of pages in buffers are modified by concurrent backends that can also suffer from buffer contention or even actually get lock contention for a hot row or hot page. Now, whilst SQL monitoring techniques are covered, providing you with information about the resource usage that we actually see, that you could actually have situations where you can have things like two bad queries that are running in completely separate time windows. An example could be actually have got two queries that are running, for example, for 30 minutes that are actually contending for certain resources, but they don't collide, they don't run at the same times. You may never see them causing any problems. They're basically running in their own times, basically going away and happily being able to run through. Now, let's look at the monitoring using PgStatActivity. Now, PgStatActivity gives you the current state of each of the back-end processes along with the query that's currently executed. And it also gives you kind of the age and the query of the related transactions that are there. And it will also provide you with information about weight events, such as things that it's currently waiting on, right? Such as IO, lightweight locks, or you could actually have your weight locks there as well. And things like your state and your weight events that are actually there on PgStatActivity activity. And let's look at the age and the event in more detail. So if you actually look at a back-end process, that's actually here. So common, usually what you'll have is the transactions that you're actually running, usually actually going to be either in active or idle states, right? This is basically an active, is where your query is currently running, idle is where it's actually gone in and run, and it's actually waiting for the next instruction. Now, the transaction, the state that actually causes most of the pain is where you've actually got idle in transaction. Now, this is because, typically in my experience, being either it's an application bug or a third-party API, that's actually there, basically means that you've actually gone in and opened in transaction, you've gone in and run some activity, you haven't actually just gone in and closed it. Now, you haven't actually issued a rollback argument. Now, this just means that the session it will actually go in and hold on to the state or the locks that's currently there until the end of the session. And it can also go in and block actions like maintenance actions like auto vacuum as well. Now, that in turn actually cause additional performance problems because you can actually end up having higher IO or have poor buffer utilization and finally also increase risk of a wraparound that'll be actually there as well. So if you actually have a serious situation actually constantly see transactions that are setting in idle in transaction and they're there for a few minutes, right? It is normal to actually see idle in transaction for a few seconds or so once the transaction is running. But if you actually see those running for a while that actually is a sign for to actually look deeper into the database as well. Wait event type and event types and kind of the state values can be seen by querying the PG stat activity view on Postgres as well. And the wait event and the state columns are independent. So if a backend process is in the active state it may or may not be waiting on some some event that's actually there. And if the state is active and the wait event is not null it basically means that the query currently is being executed. So it's been executed but it's actually correctly getting blocked somewhere in the process. So the wait event type provides information on the nature of the wait and the details of the sessions that the session is currently waiting on. And it's actually indicated by the wait event column that you'll actually see in the in PG stat activity as well. And a session of active state can go through different wait types for the life of when the query is running. For example, I update query might actually wait for a data file read per se to read data from storage. And then it might actually go in and wait for a lightweight lock on a buffer, right? As an example, in order to modify it. And whilst committing the update the session may actually go in and wait for a lightweight lock on say a, well insert, you know, or if it's actually waiting for any other process it's a heavy weight lock that's actually there. So typical wait events that causes contention art of type IO, your lock or lightweight locks that you'll actually commonly see in your postgres database. And both also uses different types of locks to prevent against concurrent modification to a given resource by two backend processes as well. And the backend process wanting to wanting to access a particular resource must first acquire a lock on that object itself, like calling dibs on it. So if another process is actually currently holding on to a lot of that object it actually goes in and adds itself to the waiting queue or the loop for the queue to wait for the resource that's currently actually getting locked. And basically as soon as that lock gets released this new resource or the new session will go in and finally acquire that lock as well. Finally, PG locks view gives you access to information about the object locks held by a transactions within the postgres database. And PG locks contains one row per active lockable object. So for example, if you actually have got locks that are actually there, you will see one row in if actually query PG locks view. In the lockable view may also appear many times for multiple transactions that are holding or wanting to hold locks onto the object as well. An object with no current locks will also not appear in the view as well. So this is actually a, PG locks, this is the dictionary you can see. Now you can use the PID column within PG static rivity to retrieve other information about the session holding or requesting a lock using PG locks as well. Now here's a summary of a table that basically shows you the lock modes that you'll encounter. And as you can see not all locks are actually compatible with each other here, right? You can actually see an example of shared exclusive row lock that's actually here, which means that it does not conflict with a operation like exclusive or exclusive lock, but it'll actually lock through, but lock through itself if it actually, there's operation that's actually blocking it. Do you want me to change the mic? This one is off. Now, that means that like, while normal operations like vacuum and maintenance operations are running on a table, it'll still be able to go in and run a select update, delete or update kind of a statement, but you won't be able to modify the schema definition of the table or run things like concurrent modification or concurrent maintenance operations on the same table at the same time as well. Now that we've actually looked at some of the wait events and the lock types that are there, let's see how concurrency can actually go in and affective performance as well. So if you actually have a highly concurrent workload with several hundred or thousands of active sessions that are there, it'll affect resource utilization and you can typically observe it at the operating system level as well. Now either form of IOL agency or some sort of resource blocks that are actually going to be there or things like a drop-in free memory, increase in CPU usage that you can actually go and observe. And that will in turn basically show an increase in the number that we're actually seeing, you know, corresponding with the CPU utilization on the database. Now this is basically done using sampling on PGSTAT activity. So that's it, look at sampling in greater detail. So in a highly consistent workload with high number of queries that you'll actually see on your database, you won't actually be able to get to the root of the issue by just looking at PGSTAT activity, right? You'll actually need to go through and understand what kind of led up to that event as a result. So when you actually have high CPU and things of that sort on the database, you'll need to look at historical data that kind of led up to your event. And one way of doing that is basically sampling PGSTAT activity. Now, sampling gives you a acute view of your workload instead of going through and adding statistics for each component. It goes in and examines your system activity at particular times across, you know, all components that you're there. And in the only samples that are available, you know, we don't actually go in and have full measures of it, but each instance, Harvard provides a thorough overview from a user call all the way to your system calls in terms of, you know, what's occurring at a particular point. So averaging those samples for a given period of time actually gives you that average activity for the time period. And what it also does, it actually gives you an aggregate of the multiple data points that are actually there to actually average on the voice. And, you know, so you can look out for queries where tuning FS should actually be focused as well. Now, we've discussed how you can actually sample PGSTAT activity and the flight of the slice of data actually captured by different dimensions also in our PGCon talk in India last year. So you can actually go in and view that on YouTube as well. So without spending much time on the topic itself as well, we want to show you, you know, how you can actually go in and look at average active sessions by state. And you can also find out, you know, where if your CPU or if your workload is actually CPU bound, IO bound, or these other lightweight locks that are actually holding things up using weight events as well. So if you sample PGSTAT activity and PGSTAT statements, you'll see, you can actually get a holistic view of your workload, right? Actually it tells you over time what the weight events that are actually there at particular point in time. So for example, you actually have the thing at this particular time and you see it's not running or you can actually see what kind of led up that what are the kind of weight events that are actually there at that particular point in time which is why sampling is so, so important. And it also tells us to actually go in and correlate average activity with how many times a particular call is actually made as well, right? For example, if you just look at this in said event, it basically tells us how many times that event also kind of led up to that particular event as well. Now with that, I'll hand over the mic to Samil for starting the scenario with you as well. Thanks, Samil. I know we are behind time a bit and we got two minutes to kind of wrap this up, but I do want to talk about this very quickly. We talked about another use case of how when you're sampling these weight events from PGSTAR activity and when you use that to build this kind of view that we talked about earlier where you get to see how the trend has been in the weight events. We talked about that in great length at PGConf India and what we did for this talk is we used a different use case. So we built a different scenario, different use case for that where we see that the throughput, the number of current actions that could be processed by the system that drastically drop almost 10x, all of a sudden. So we decided to take a look at the other matrix that we had that we sample. We see that the CPU utilization for that same period also has reduced. So that is a bit confusing, right? The CPU utilization is lower. So what is causing this degradation? It looks like less amount of work is getting done. And then we go and look at number of backend sessions. So number of connections is kind of static, but we see an increase in number of block transactions. And all this is because we are also sampling other matrix, right? PGSTAR statements, we are sampling PGSTAR databases, PGSTAR activity, and we are sampling it not just the number of connections, but also their state. So we see number of block transactions has increased. Then we go and look at the sample wait events. By the way, if you're wondering, this is not something that I have plotted myself through Excel sheets. This is something that's available to our customers on RDS and Aurora. It's called performance insights, but it is something that you can also build yourself if you're sampling PGSTAR activity and wait events. So this is where we see that there are two wait events that show up almost at the same time, when we see performance degradation. And these are transaction ID lock and tuple lock. Now, transaction ID lock happens when you have a lot of concurrency in your workload, which is trying to work on the same row. You have two different connections, trying to update, insert, or acquire a lock on the same row at the same time. And it's basically like one session tries to acquire a lock. It tries to do the update. The second session, when it tries to acquire the same lock, it is going to be blocked, referring to the table that Ronil was talking about earlier. And then that's how we get to the transaction ID lock contention. The tuple lock contention is also related to two different sessions, trying to do select for update or select for key share. So that smells a lot like row lock contention. So we went and tried to look at sequels that were running, so top sequels at the same time, and we identified there is this new query that is pretty much did not exist before. And this smells like some application change, yeah? So we went and looked at PG lock. PG lock is where all locks are stored. So we could use PID column to identify, we first went and scanned the PG star activity for the query, used the PID column to look for all the locks that it's waiting on. You can use granted column to filter on the locks that it's waiting on. We identified there is an exclusive lock it's waiting on. We used PG blocking PIDs function to identify what are the other sessions that's blocking my connection. And then we went ahead and tried to identify what are the locks held by those blocking sessions. And we found out that there is indeed a row lock contention. And all right, we were able to investigate this further. We identified that it was happening because of some application change. We went ahead, worked with the customer and the developer to fix that. But the recommendation here, all right, PostgreSQL has got a deadlock detection mechanism. What you need to make sure is that you set deadlock timeout and based on that PostgreSQL will make sure that it looks for deadlocks and kills session that are participating in that cyclic dependency of these lock weights. And you can also log the lock weight events. As a good practice, make sure that you're sampling PG star activity, you're sampling other statistic. When you sample PG star activity, you're taking pictures. When you put them together, it's like building a motion picture about your database history. And then you can add additional information by logging key events like slowering SQLs, checkpoints. And using that information, you're basically providing sound to your motion picture. And that's the way you can go back and look at what has happened in your database, debug performance issues. All right, the same QR code once again, if you're interested to learn more about AWS, that's a wrap. Thanks everyone. Thanks for being patient with us. Thank you. Take a little bit longer than what was scheduled. Any questions to the speakers? The question on where is there is a time, time looking was there in slide, right? One slide was there when you were talking about, go back. There's CPU utilization and there are two block was there. Yeah. Yeah. What is happening here? No, no, no. There's a session lock on CPU utilization and there are two blocks were there. Yeah, sorry. I think we have six. Okay, anyway, the question is around like how, there is a one struggle I have found it like, when we have a CPU utilization goes up and when the stations are increasing tremendously, how can we find out which process is exactly getting that increase in my stations by running the query? What I did a hack is like I just retread the server and immediately just flush out the session. But there is some queries also, right? You can immediately find out, okay, which process is actually taking the, that particular, that station is increasing. Yeah, so there are a couple of ways. One is, of course, you can try and log PostgreSQL queries. Yeah. And long running queries, you can go and look at their PID and go and look at the process history. If you are taking a snapshot of top as well. Okay. That's one way. There is an extension. I'm not able to recall the name of it is PGPROC or not PGPROC. There is an extension. You can use it with PostgreSQL. It's also supported in RDS Aurora, which allows you to look at OS matrix from the database interface. Oh, okay. And it gives you PID so you can join that with PGSTAR activity and get the queries and their CPU utilization. The third way, which is a little bit more intuitive is if you're taking sample of PGSTAR activity, if the weight event is null, you can assume that the query is likely doing something with a CPU is getting executed. And it's doing some memory IO into the CPU cache, L3 cache, and then getting things done there. So it's basically using CPU. If the weight event is null. And that's what you, if you're taking samples, then you'll be able to see which is the CPU mostly waiting for, which is the query mostly waiting on CPU. And that is your likely contender for increasing CPU utilization. But sometimes it's not that obvious because this is a case, clear case, right? That query is directly waiting for CPU. But there are also scenarios. We talked about this lock weight, right? The throughput has degraded completely, right? But the CPU utilization did not reach 0% here, right? If you look at this, the CPU utilization is not 0%. It has come up. But if you look at the utilization, the throughput, it's like completely gone, yeah? So where is the rest of the CPU being used? Because we are using spin locks, Postgres uses locks. So, you know, query is basically spinning the PID, the process is spinning and waiting for the lock to be available. So it's also using CPU, not directly. So you will probably, I have not shown it here, but this is likely a scenario where you will see the system CPU go off. So it's also important to capture CPU matrix, not just the CPU utilization, but also what CPU is going up? Is it user CPU or is it system CPU? Okay, got it, thanks. Any further questions? We have two more minutes, three more minutes, no? Well, then let's thank the speakers for the presentation. Thanks, everyone. Is the next speaker around? Yeah, yes. So you have the advantage that you have two minutes to set up. Let's wait two more minutes, that they are on time. Okay, someone come to my different session. 75 minutes? Just in case you don't know this by heart, no, I know you have. It's interesting, the next song is at 6 p.m. So this is one hour and 45, so probably there's a coffee break. That means, give me a second, I can do some more tap dance show. I can see. I'll take some for you. I hate those. I'm responsible for them, I'm excited about it, I hate it. Probably could be better, I can see. This is very 6 p.m. Here we are. Normally it goes really fast. 35 minutes, yeah. So and then we have a short break. So 45 minutes, yeah. I will tell you about 10 minutes before, something. How do you pronounce your name? Julien Lujo? Lujo, yes, not transcribed in any language, except English. Ah, I'm used to it, I'm used to it much more strange. Okay. So the next presentation is about anonymization in Postgres by Julien Lujo. Please welcome the speaker. We have about 35 certified for the talk and then short break, coffee break, and I will mention it later on. So please go ahead. Hello, oh, hi. Okay, so hi, hi everybody. So today I'm going to talk about PG Anonymize, which is a new extension for transparent anonymization on Postgres. But the main topic of the conference is actually to present a bit of Postgres infrastructure to extensibility to add your own code running inside Postgres and a bit of information on how it works and what is the query workflow on Postgres. Okay, so my name is Julien Lujo. So I've been working with Postgres since 2008. I'm a major contributor. So I've been working as a developer in DBA. So I'm also the author of Hypo PG, Powa and some other tools. And now I'm working at Nile, which is a company that's doing a Postgres on the four SAS, serverless Postgres. So today's agenda. So first I will quickly introduce what is PG Anonymize and what is scope and what is supposed to be solving, what problem is supposed to be solving. The main part, which will be the second part, which is actually how it's working under the hood. And the last part, if I have enough time, it's a bit of example usage on how you can set it up and see like the anonymization working. So first a bit of introduction in case that you're not familiar with it. So what is anonymization? So the goal of anonymization is to remove or modify or degrade like a personal identifiable information or PII, which is something quite, something very popular to do recently, especially in Europe and US. And the idea for that is that once you anonymize the information, you can't use it to identify one individual in particular, but you can still use the data one way or another. So as an example, if you can identify someone by his birth date, you can actually truncate the date like the year of the birth year of the birth decade. And when you do that, you can still do some statistics. Say, oh, like people from 18 years old to 25 years old, you can do some statistics, but that's not enough information to identify someone. So that's what anonymization is supposed to be used for. So PII anonymizes an extension. And the goal is to do, to provide a way to perform the anonymization automatically, transparently. And the scope, it's only that. It's only provide you the infrastructure to anonymize the data, but it doesn't provide you like a function to say like, how do you anonymize the name, a credit card number, or birthday or anything. You have a lot of library to do that. So I recommend the Python Faker library, but you have a lot of other stuff and you can just use them as a SQL function or PG Python function. We will see that in the last part. So the goals I had when I was working on this extension is mostly to be safe and robust. So a user that is actually anonymized shouldn't have anywhere to escape the anonymization and see the real data. The DBA is supposed to set up things and whatever you do, you have no way to see the real data. It's also supposed to be transparent. So you have nothing to do to see the anonymized data and any tools you actually be working with will honestly anonymize the data. So it's working with a PG dump, with a PG sample or any tool that connects to Postgres. And there are no limitations. The only limitation I actually put explicitly is to ensure read-only because in my opinion, doing some writes when you see anonymized data, anonymized data is a good recipe for failure because if you're going to modify the data when you don't have the original one, it's probably a bad idea. So now let's see how the extension works. So the general idea is to intercept the query that is happening on the backend automatically. So for instance, you do like a select star from person and the extension will rewrite the query automatically and transmit to something that injects the anonymization. So here for this very simple extension, very simple example, the extension will rewrite the query with select star from with a sub query. And this is actually the query that will do the anonymization. So here I'm just doing a naive example or anonymization of a phone number which I just keep like the phone code which is actually Taiwan phone code and X66 for all the number. So you know where but you don't know the actual phone number. So if you execute that instead of that, you will see the data anonymized and that is no way you can actually access the original data. So let's see how you can do that with Puzzres. So one thing you need to know for Puzzres is one very convenient way as a developer to execute your custom code is to use what Puzzres call hooks, which is actually like a point in the query execution and other part of Puzzres where Puzzres will actually execute the code you give to him on top or in place of the regular code. So when you load your module, you just declare the list of hooks you want to use. And then when Puzzres does some action executing a query or something, if your hook is registered, it will call it and do what you want it to do. So the only difficulty here is to find a suitable hook to do the query writing I was talking about. So here's a query workflow. So that's not the whole truth but it's very, very close to the reality. So as a client side, what you do is like a select star from your anonymized table. Everything else is happening on the backend side. So first you have the parser which will pass your row SQL statement and produce one or zero or multiple structure called a row query. This will create for each of them, it will go through the path analysis stage and output one query per row query. This will then go to the rewriter which is what we use when you have a view or rules on any of your table. And once again, outside of the rewriter we'll get one or multiple query which is the same kind of object. And then for each of the query you will go through the planner and the planner will give you like one plan per query. And the plan is what the executor will actually execute to get your data or whatever was your query. So we will focus here on the parser analysis which is what happens right outside of the parser. So it transform a row query into a full query and it contains all the information you need to fully identify all the objects you have in your query. So for instance, if you do a select star it will transform the star into like a select phone number, name, full name, everything. It will check all the data type you are using would check that the phone number is a text or things like that. So you have all the information you need to fully know the query. And at the end of the parser analyze stage it will call a hook which is called the post-parse analyze hook. And that's one hook we can use for BGNMIs to transform the query and inject anonymization on the fly. So for all the pointing code I'm talking about in this talk I show a quick example and the file why it's actually defined. So if you're interested you can look a bit more into it. So you have a lot of function used for the parser analysis. One of them is a parser analyze fixed params. That's when you have like a fixed number of parameters. So you get your row statement as an input data and also the original query text and a bunch of other parameters that are not really relevant here. So here you see it called a transform top statement which is like transform a select or delete or update or anything into a query. There's a lot of other stuff and eventually called the post-parse analyze hook using like the parse state, the query and the jump set. The jump set is what's used for BGNM statement to check the normalized query and generate a normalized query string. And then he returned the query after you possibly modify it in the post-parse analyze hook. So now what's exactly inside a query? So it's defined in a parser-nose.h. So it's a simple C structure that contain all the information that the planner will need to create the plan. So it has a lot of fields. So you have, for instance, the select part of a select segment, you have all the columns. You have the qualifiers, which is like the where closes, the join closes. And you also have, which is interesting for us, the list of tables, which is stored in the R table field. So you see here the query structure. So you have a command type, so for instance, for a select and you have a list, which is used to be a link list in Post-West and is now an array, error table, which is the list of runch table entry, which are basically the tables. So let's look at the runch table entry. Yet another struct, C struct. So what's interesting here to see is it defined not a table, but a relation. So in SQL, a relation is like a set of rows. It could be like a plain table. But if you join two tables, it's actually a join relation. This is also a set of rows. And in our case, a subquery is also a relation. So we have our runch table entry, which can be a plain table, so RT relation. And you can probably see where I'm going from. Four is what we basically have to do is transform a relation to a subquery inside the query structure. And we can actually inject the anonymization on the fly automatically. And here you can see that. So the list, there are a lot of fields in this structure. And the field that I use is different for each RT gain, but you can see that for a subquery, we have a subquery field and it's also a query. So a query can contain another query, which for a subquery is a query that you actually be using. So to anonymize a relation, so another thing you have to know about the relation is that in the query structure, you don't store, for instance, like the table name or the field name or anything. You actually only store offset and object identifier. So when you have a column, say when you select phone number from person, the query will not contain phone number. It will contain the field number one of the relation, something. And a relation, you don't store the relation itself, but you saw an offset in the list of French table. So it can be like the column one of the relation two. That's important because if we actually change the RT from a query to a subquery, it won't break anything here because as long as you output the same number of fields with the same data type, it doesn't matter if the query change itself, it will still be like the column one of the RT two. So it will still work the same and it would just see new data. So all we have to do is change the normal relation to a subquery that does the anonymization. So how can we do that? So the idea is, as I mentioned in the first slide is to generate a subquery to do all that. So we generate the subquery but looking at the original table loops through all the columns and generate a select field one, field two, field three and replace field three with any expression that outputs the anonymization depending on how you configure it. So we'll see that a bit later but in the security level. And yeah, also I did compatibility for inherited and partition table. So if you anonymize a root partition table and you select from one of the partition you will inherit automatically the anonymization from the root level. So you don't have to do that for every single partition. So PGMonimize, that's a short extract of the code. So first I just get all the security level which is a list of for each column what's the expression you want to use if any to anonymize the data. So I just start by outputting like a select and then I just loop for all the attribute numbers which is like a column one, column two, column three of the relation. If I found something, if I found a security label I just use the expression which is a security label as and I use the exact same identifier that's not exactly required but it's easier to debug. And if I don't have a security label I just output the original column. And then I just finish using back from the original table name. So when this stage is done I have something, a query text which is the query I want to use inside the original query. So once we have that we just need to do exactly what Post-West was already doing which is passing this query calling the pass analysis to get the query structure and use it replace it with what was already done. So obviously when we call the pass analysis we don't want to call again pigeon anonymize we don't want to anonymize anonymize data. So there's a flag there am I calling inside pigeon anonymize and in that case I just don't recursively do it and explode. So once I have this query so I do the parser which is a query parser called pass analysis which is pass analysis fixed params because I don't have any parameter here. And so that's how I do it in the extension. So here I assume I have the SQL statements that have been built before. So I just call PgPars query I get the first node so it returns a list of queries. So since I know what I generated I know that I only have one query so I just get it. I call the parser analysis and I have my query and I just hack the RT and say oh now it's not a RT query it's a sub query and the sub query is still so in reality there are a bit more code but we clean up things a bit but that's basically it I use the sub query here I say it's a sub query and when it's done my work is finished now I say okay now I injected the sub query with the anonymization I just say return Postgres keep doing your job so Postgres will plan it executed and the user will see the anonymized data without having done anything specifically for it. So it's all transparent and there's nothing the user can do to escape it. And since everything is working with column number and RT offset nothing will break because we still refer to the third column of the RT2 and so on so it's actually very safe. So all of that is working well for all DML and DQL statements so select the data and everything now if you do like a copy for example if you use a PgDump this is not going through the same schema I showed you with the pasta analysis and everything it will be bypassed because for a copy you don't need to actually generate the query or everything so it goes through another thing called process utility and fortunately there's also a hook available there so it's process utility hook so I can just use exactly the same technique so if you see a copy in something two which is like exporting your data or copy a query two I just do exactly the same thing and instead I generate on the fly a copy with your anonymized query two and this way any utility statement will see anonymized data so PgDump will dump anonymized data okay so now it may be a bit abstract for note I have some slide to see how it's actually working how you configure it so hopefully it will make me more make it more real so it's using as I mentioned before a declarative approach using a security label so security label is a feature in process where we can attach a random string whatever you want on an object so for PgAnonimize we use two kind of objects we use the column and we use the column to attach any expression that you want to use to emit anonymization and we also use the role and this way you can say which role is supposed to be anonymized and which is not anonymized so this way you just have if you connect as an anonymous role you will see anonymous data and that's it you can't escape it and so the security label is also entirely done by your custom module so it needs to be loaded to use it you can't declare a security label for a module that does not exist so this one is important you need to make sure that the extension is loaded first so the basic step to do that is make sure PgAnonimize is loaded this one is due to early configuration so you declare your anonymization rules using security label so basically it's just expression you declare which roles you want to see that are using anonymous data and then you configure the role to load automatically the extension we can load it for everyone if the extension is loaded and your role is not anonymized it will do nothing but it will call an extra hook which is never good for performance but it won't do anything apart from that so the first step is to make sure PgAnonimize is loaded so in this case I use the load command because it's very convenient so it requires super user privilege but that's only for the initial setup and if you want to use security label you also need to be super user anyway so when it's loaded you can declare a security label for the PgAnonimize extension so you just declare your role so here this is the rule I've been using since the beginning of the talk so this is a utility statement so security label is actually a utility statement and so the syntax is security label four so that's your module so in my case it's PgAnonimize on what you want to use it so in my case it's a column it's person.phone number and I said now I want to output this so I'll put the hardcoded plus 886 which is very bad as anonymized it's assume everyone needs because hopefully not the case but that's just for a very simple example to show it's actually outputting something different than the original data a recommendation is to use $quoting which is $ this way you can use a single quote inside so it's easier that putting multiple quotes everywhere so easier to use and extra security measure is make sure you return the exact same type because the person is this will check that when you do like a set of phone number it will know it's actually a text so if you want to change your extension and now the expression is returning like integer everything will break because now it will say it will try to admit a number as an integer or the opposite way and usually it will not work so you have to make sure use the exact data type so here you see I'm explicitly casting my expression to text even though it's actually already a text if you don't, PGAanimize will check it for you and raise an error if it's not the expected data type so once you need to do that for every single column of every single relation you want to anonymize so when it's done you go to the next stage which is declaring a specific user as anonymized so it's also security label and just say now like unroll Alice and now it's anonymized but now every connection open after that with Alice role will only see anonymized data and the final step is to modify a session preload library so that it's done automatically so you now do alter role and you set session preload library equal PGAanimize and other if you want and now Alice will automatically be anonymized but she doesn't ask for it when she connects you as a DBA decide that she will see anonymized data and also be careful it's something that is done when your connection is open so if Alice already have open connection the whole connection will still see the original data so you can kill all the existing collection if you want to make sure that she only see the anonymized data for security and now if you just use it so I actually use the exact command I showed before and now if I connect with a user postgres and I do like select star from public.person I see Jendo with like a random phone number and now if I connect as Alice and I get the anonymized data instead without doing anything apart from connecting as an anonymized role so it will work with any client, any library, any tool, dynamic SQL, PG dump, so the same if Alice is doing like a start function that is actually calling SQL she will also always see the anonymized data so that's the main goal of the extension to be very, very safe and guarantee that the user only see the anonymized data so what is, sorry, already handled in the extension so all the general infrastructure I showed you is already stable and working as expected inheritance is handled with like a partition table and inherited table and with also the possibility to overload something on the partition itself if you want to have different anonymization on different partition table, a lot of sanity checks so I mentioned it makes sure you have the correct data type it makes sure that your anonymization expression doesn't try to inject data so you can do that when you pass the expression if it's written one pass three, it's only one query if you've written multiple pass three you know there is an injection so you can't use it and it makes sure it's read only the next things I'm planning to add to the extension so a lot more of sanity check and pretty runtime sanity check because you don't, if you do like an alter table and change the data type you don't want the anonymization to break your everything so it has to be done at runtime too make every rule actually a bit more thorough to check everything and also handle rules and view which are not fully handled for now because it's done in the post-pass analyze hook which doesn't see what is done after the rewriters for this one we just need to do it at the planner hook instead of the post-pass analyze hook but other than that, it's the same and so if you're interested in the extension you have the link here and my contact are here so if you have any question about the extension or extensibility in post-press I would be happy to answer thank you very much I have a question Any question? Optimizer knows about the mask you introduced for the phone I mean, if it will plan a groping query with gropings it will keep in mind that you added a lot of duplicates yes the optimizer will know because it will be the same as you execute the actual query so if you use a fixed expression for the phone number and you try to query on it Pazver should be smart enough to see that it's all the same rules or none of the rules and that it can't use an index and so it has a huge impact of performance, obviously okay, and second question can we speed up our queries using your anonymizer? I mean, if we don't need, for example, family second name we can just drop it from the query by using your template and speed up and improve selectivity in some way, reducing number of clauses in expression will that be like a push-down of where clauses or something like that? we reduce number of clauses reduce number of expressions with flattening our query by your anonymizer and speed up in some way, it could work it could, but it adds many ways when you can actually break the anonymization so for now I'm focusing on the safety part so there is no way to use an index on that there is no way to... if you only want like select phone number from your table the query will still return all the columns which is not great, but it's safe it won't crash and everything, so that's my main focus so there are a lot of things that could be done to improve it but I don't think like trying to do some OLTP workload with that enable doesn't really make sense I say it's more used to generate like a report or some statistic or something for which the performance is not really the biggest aspect, I think but yeah, if user complains it's too slow or something I will be happy to look at it but that's a lot of work to make it a lot more performance just one question so when you have this module loaded I understand correctly by default this is not anonymized for anyone unless you specifically say for allies so for allies it is anonymized but for others it is open I mean for role-level security I'm saying let's say I have this module loaded and for allies I say anonymized the results for other users it is open, they will see everything yeah, they will see everything I'm wondering shouldn't it be opposite so anonymized for everybody unless you specifically say for this user don't anonymize you could do it for now I choose like the opt-in instead of the opt-out but if that's something you would like I can definitely make it one way or another that wouldn't be hard to do okay and one more thing so do you do this extra layer of parsing I mean rewrite the query does it have any performance impact? yeah, it has a lot of performance impact because query writing is done every time so if you have like a cache statement or something it will still do it every time so it's very impactful so how much is it? how much is it? it really depends because if you have a query that returns 1.0 and you have like 1000 columns it will be insanely expensive if you have like 1 billion rows and 2 columns it will be free so that's how much you have versus how expensive the anonymization is so it really depends on your data thank you any further questions? yeah thank you hi Julian hi are you thinking that this anonymity can be used for avoiding GDBR? sorry? for complying with GDBR yeah that's one way to use it so as I said it provides infrastructure for it so I think GDBR is also but not to store the data more than not to output it do you also check the GDBR rules? is there any something that you can answer by using anonymity? I don't know all the rules for GDBR but you would have to implement them like I said it's only give you a way to do this but if they say like for instance a query get number should be only the 3 last digits you have to do it yourself you have to write like a substring or something so it's not the hard part this is usually easy to do in the library to do that so it just gives you a way to make sure it will happen and that you won't see a non-anonymized data won't thank you any further questions? how about performance of the plug-in or NINable how many RAM memory will be increased? oh yeah so it wouldn't be that much more expensive because it would be it would be planned only once it's just instead of from person it will select start from phone number so it's a bit more expensive yeah that there are a lot of plug-in oh yeah so instead of planning that it will be planning that but that's only one planner execution this one alone won't be planned as is it would just be passed twice so the passing is a bit expensive but not as much as execution if you have a lot of rows but usually you don't want to use anonymize an OLTP workload like it should be used for specific users like if they want to do some statistic or like a data warehouse kind of queries and in this kind of workload the planning stage is not that important like most of the execution time and memory is for execution not planning oh yeah thank you well they're not thanks the speaker again thank you very much we as far as the C we have a break till 6 no well till 4 till 4 I'm in wrong time zone Japan time zone till 4 p.m. so yeah there is coffee I guess and enough time to discuss and ask the question so see you at 4 and yeah thanks for your time because previous sessions were continuous and just came so with a fresh mind we are going to talk about PostgreSQL the database size management and database architecture in PostgreSQL is bit different than any other database so our focus will be mainly on on vacuuming processes so in this discussion we will go level 100 level 200 and somewhere level 300 we will talk about couple of parameters how to tune the things so let's see the agenda we will see architecture view MVCC and the vacuum process and auto vacuum because these are unique to PostgreSQL databases so we thought this is something we can share with you where you will learn something so in architecture this is a conceptual view of PostgreSQL and in any database the database always have some storage level files but there are associated memory structure and background processes so in background processes in PostgreSQL we will be mainly focused on auto vacuum part this is quite interesting about PostgreSQL and let's see MVCC this is the part from where actual discussion start in PostgreSQL that's one multi-version concurrency control this is a full form of PostgreSQL MVCC this is unlike to any other database like Oracle database in Oracle where you have undo table space and you store the previous images of the data say for example you are updating a data or deleting the data stored in a separate or dedicated area that's called undo area in case you roll back that image comes back but in case of PostgreSQL that happens not in a separate area it happens in the same area where your table or relation is being stored so each transactions in PostgreSQL is given a number of 32 bit so 32 bit is the maximum size of a transaction ID in PostgreSQL where it can store up to 4 billion number of transactions for the database the basic idea of MVCC multi-version concurrency control is that all the readers or any reader should not block any writer or any writer should not block any reader means the users concurrently connected to the database let they all able to connect let they all able to work on the same data at the same time and that's the purpose of multi-version concurrency control so this data change of previous images means storing previous image data happen in the same file where the relation is being stored in PostgreSQL this is unlike to any other relational database management system so when you insert a new record in PostgreSQL it holds a transaction ID who was the transaction which transaction ID inserted the data in the PostgreSQL relation but when you update or delete the data in a table it doesn't delete that tuple actually within that relation it creates a new tuple that means the old image of the tuple remains in the same table and it creates a new column a new record on behalf of update or delete a statement so insert is quite simple insert won't waste the space but update and delete they will create a new record but still it will be holding the old image means previous version of the record and that is for the purpose of multi version concurrency control means maybe other transactions are using the same data and let other transactions keep going as usual but new transaction has created a new image of the data so having old image of the old image within the tuple if the keep growing that space is a wastage if the queries come on such tables they unnecessarily scan those pages who are holding old images means deleted tuples so you can think of this is something kids party at home means how it happened kids party at home means everything the house was perfectly tidy but after the party everything gets messed up right so we need to clean up those things what's the solution how to clean up those things at my home the simple way is I can keep a robot and let the house get cleaned up automatically but in case of PostgreSQL when these things happen means very much very high deleted tuples are existing in the table how to clean up those things the solution is vacuum like I do at home vacuum means the normal vacuum means you clean up unnecessary things so in case of PostgreSQL we do cleaning of dead tuples right that tuples means older version of the tuples which are no more required by any active transaction within the PostgreSQL so how the vacuum itself is a command means VAC UUM vacuum so vacuum command does various things so the first thing is recover or reuse the space occupied by updated or deleted tuples and the second thing it updates the statistics of the tuples update the statistics of the tables and it updates the visibility map of those indexes where index only scans comes in means you have created index on couple of columns all those columns are coming in a query where clause query is using those columns and if all those columns are part of a index so then all the data can easily be fetched out from the index only it won't be going to the table so which pages of such indexes is having actual data that visibility map is also maintained by the vacuum process very important there is a limitation in the most great SQL where it can hold up to 4 billion of number of transactions so to get rid of that problem of transaction that around vacuum process does the vacuuming means freezing the old transaction IDs and also it updates the free space map along with the visibility map as usual it does storing the several statistics in PG stack etc. these are lots of dynamic views which stores the relation relation related statistics within the tables how it works just remember this slide I'm going to use the same slide in next 2-3 more slides but how vacuum starts in a normal way so vacuum start first it scans the heap in postgres SQL heap is a storage or is data which holds the actual table and its data so first vacuum process gets ready to start the vacuum process so in postgres SQL it's a auto vacuum launcher and then it starts the vacuum processes and second thing then it actually scans the heap which are the pages where dead tuples are there and then it does the actual cleaning or vacuuming of unnecessary thing unnecessary means dead tuples from the table and at the end it does index cleaning also and if any table where cleaned up pages exist at the very end of table then it does free up those pages also so it's quite simple initialize scan vacuum and clean up this slide I'm going to use in next more slide this is about vacuum will happen means someone will come to clean up the dead tuples but I want to automate all these things so then auto vacuum comes in so auto vacuum daemons it automates vacuum and subsequently analyze commands so because I can't automate something automatically I needs a mechanism so that I can run my vacuum process frequently on those tables so auto vacuum does all that thing for you it does freeze to protect against the transaction ID, transaction ID wraparound it is being controlled with the help of auto vacuum parameter it's a very important parameter you can set it off but actually it will not run then but last thing it can do is it can run only for transaction ID wraparound if you turn it off it will still run to protect you from transaction ID wraparound we discussed this we thought to simplify this because vacuum and auto vacuum what's the difference between the two so let's see this thing we have created a pseudo code to simplify this auto vacuum process how it run actually so assume there are various tables within a database and the tables each table a while loop and the pseudo code work like that if auto vacuum max worker process means there is a limit how many number of auto vacuum process can run on a database if they all running it won't do anything it will wait at least one is getting free and will come out if that is not the case it will check if any of the table is having auto vacuum means if any of the table transaction ID is reaching to auto vacuum freeze max age if that's the case so then it will start it will acquire a share update exclusive log on the table and will run a vacuum on that table in a aggressive mode this is in the order of priority how it will execute actually if first two are not the cases it means vacuum vacuum free processes are there no transaction wrap around third is if a table that exist with dead rows and dead rows ok dead rows are more than the vacuum scale factor and number of tuples and vacuum threshold so then it acquire a share update exclusive log for the target table and then once it get the log then this is the same thing what we have seen in the previous slide so the difference between in case of auto vacuum and vacuum is that auto vacuum controls the vacuum when it has to execute and that's it so let's see couple of scenarios in the interest of assume you have a table where high rides are there high rides means those rides are sort of updates if high number of rides and deletes are there what can happen you will have more number of dead rows so you would like to vacuum those tables more frequently so to do that you can decrease auto vacuum course delay and trigger the vacuum more frequently you can have large number of large tables with infrequent updates means the table size is quite big but there are more updates not much updates so in that case you can have a wastage of resources on your PostgreSQL system where it is running if you want to reduce the frequency of auto vacuum so then you can increase auto vacuum course delay this is this parameter is in by default millisecond if you don't specify the unit say for example auto vacuum vacuum course delay is equal to 10 so 10 means it's 10 millisecond I mean a process of auto vacuum will get trigger after next 10 millisecond on the same table if there are critical tables for the performance so if there are few critical tables where performance matters means your application is very much sensitive for those tables so for those tables you can set table specific parameters means those parameters you can set at the table level but if you know there are few application means application performance sensitive table so then you can set these parameters for the tables level also also very important vacuum is not free of cost when vacuum process runs it does means it cause some resource consumption minimum at least 10 Mbit can take on the operating system where the process is running so you need to take care your auto vacuum is not consuming too much resources on those tables on that system for that if you are using your PostgreSQL on AWS based system so you can use AWS Lambda function and schedule the jobs in off peak hours or you can use some Chrome jobs also to vacuum your tables and adjust the auto vacuum cost delay parameter monitoring that's very important because auto vacuum can slow down your overall system so this is very important in case of auto vacuum you need to regularly monitor the PG stat user tables and stats activity what's happening also if you are using AWS RDS for PostgreSQL or Aurora then maximum use transaction ID this is a cloud watch matrix that's something you need to monitor by default it goes up to 200 million and if you have set alarm and you can you can be notified if it is reaching to more than 200 you will be notified and it does the auto vacuuming for those tables you may be interested to understand which means how much time it is taking in vacuum process for each table so auto vacuum minimum duration parameter is the one very important if you are doing any changes with these parameters start changing them slowly and don't go with a major change and first do the testing on a non production system troubleshoot the problems always happens with the auto vacuum if your application workload is right means unstable in terms of the workload say for example daytime it is too much or even night time the batch type of job is coming so first check if your auto vacuum threshold is not meeting if auto vacuum set the parameter group is a parameter from RDS for PostgreSQL or Aurora for PostgreSQL so if this parameter is set to off sometime it's set to so if it is set to off so then auto vacuum process will not run and maybe auto vacuum is disabled at the table level or any bottleneck in the auto vacuum process maybe the system is not able to provide that much throughput at the storage level or at the CPU level so that is something needs to be checked what sort of bottleneck it is experiencing if there are open long running transaction on any relation then auto vacuum will not run on that table at least it will skip those tuples which are busy as part of any transaction the key points in case of PostgreSQL why I am mentioning this point the architecture of PostgreSQL is different update and delete cause blot blot means unnecessary storage wastage something needs to be cleaned up to enhance the queries performance blots always slow down the queries if the blot size is too big comparing to the overall table size and its data size tune auto vacuum automatically so that the vacuum process can run automatically with the help of vacuum and auto vacuum follows the rules what we have seen the pseudo code what we have seen the same pseudo code auto vacuum follows there are few more things behind the scene what runs in the auto vacuum but on a high level that's the code how it is running and auto vacuum must be customized for different types of workload this is important and auto vacuum require monitoring for any potential issues so if you are seeing a sudden issue related to auto vacuum so that needs to be that needs to be rectified and identified a protocol that's it thank you thanks for the presentation any questions so question is around like when we say it's blotting the size right so is there any chance that by using this utility will also helpful to reduce the costing side because it belongs to the size of the table something repeat your question okay the question is about like if I use the auto vacuum and it will be clear up my space also right so automatically it will be helpful to reducing the costing side yes it will reduce your overall yeah auto vacuum help in that also yeah thanks thanks for the inside full session yeah thank you another question just a means key point vacuum is specific to postgres SQL this is unlike to any other database so just be mindful if you are deleting or updating then you have to clean those old dead tuples from that table how to clean it vacuum so this is just like a kids party happen at home and you can imagine the house condition so once the part is over you do the whole house clean up right so this is exactly the same way in postgres SQL but you need to automate those so vacuum and then auto vacuum on top of that yeah well if you don't yeah that's fine so if you don't have any thank you for your talk and I have a question what is the difference between between postgres VMCC and MySQL VMCC and what is the advantage and disadvantage of post VMCC okay okay so if I get your question clearly you mean to say what's the difference between MySQL sorry MySQL MVCC and Postgres SQL MVCC yeah and the advantage and disadvantage of Postgre MVCC okay the very basic difference between Postgres SQL and MySQL MVCC is that it does old records holding means in case of Postgres SQL it holds old records in the same tables though they are deleted but in case of MySQL those old records are hold within a undo area it's a means separate storage location within that database right that's the basic difference so when I'm holding my old records means deleted records in my same table then it's a big problem if number of deleted records in that table is too big it will slow down the query performance so this is a disadvantage from advantage point of view I'm not writing those deleted records information in undo area so that IO I'm saving in case of Postgre SQL but in case of MySQL to maintain this concurrency I'm writing those old records in a different area right so you can have some benefit in terms of IO reduction when deleting the data but when you are not so this is the basic difference between the two so what are the advantages and disadvantages of Postgre MVCC so as for my understanding in case of Postgre SQL if you are deleting the data IOs will be less slightly as compared to MySQL this is advantage means directly that benefit will be visible in your queries means delete or update maybe faster in Postgre SQL comparing to MySQL because your immediate IOs are less so that's the advantage thank you sir okay thanks everyone for their attention and let's thanks to speak again one two I would like today to talk about extended statistics and our new extension which we designed for free to match discussion in the community about statistics how to improve it being invented introduced in Postgre 12 it still doesn't work quite rarely in real installations because people at first are afraid of their heads because they don't know how difficult it would be for vacuum and the second reason is because they know on which set of expressions or columns create extended statistics and our solution tries to resolve both of these problems in some way working for Postgre still 2017 a contributor to core I designed some features which help optimizer to be more smart it's adaptive query optimizer based on machine learning it's a plan freezer it's replanting today it was committed or any transformation and so on and most of my features was designed according to query driven approach based on queries and I see for some time how it works and see that it have weak points because of the architecture of Postgre and this time we tried to implement our new extension according to data driven approach and maybe it make sense understand what we are on the same page I want to tell something about extended statistics how it works you can easily find much more detailed explanation in Tomasz Wander's speech in German you can follow the link in the score code but in simple words create statistics to define table and define number of columns on which you create the statistics and after that extended statistics we will try this model we will try to find dependencies in between columns of the data unfortunately we need we can create statistics only on one table and it's the reason why it doesn't help with joints with more complex structures just with single scans right now extended statistics contains three types of statistics at first it's most common values it's values which frequently you can find in the database in the table and number of distinct values it calculates around all possible combinations of columns you use in these statistics and dependencies it tries to identify functional dependencies between separate columns or combination of columns we will try to understand how difficult it is to calculate extended statistics let me show you MCV most column values are just two arrays values and their frequencies in the table so increase the number of columns you just increase the size of a value it's maybe not so difficult distinct number because it's just a one number but it calculates on each combination and it's a lot of combinations for example for eight columns we have about around 250 different combinations with dependencies statistics it's also one plot value but we need to calculate for eight columns already about one thousand number of different combinations it's definitely too much so in my opinion we should somehow differentiate in a vacuum we need to analyze comment and analyze calculating plane statistics on each column and extend the statistics at least give database administrator chance to decide what to do just to don't get a lot of overhead this extension how it works just to show how it works you can see on this example to use this extension and after that adding any table and indexes on this table you immediately will get in this database new extended statistics these statistics will be defined over definition of indexes of course as you can imagine people like indexes in the database and they can create a lot of indexes over one table so we should think about compactifying and we are trying to do this approach this problem right now we just remove duplicated statistics but we have a room for improvement here and we limit number of columns in the definition because as you can see before if someone will define 10-15 columns index on the 10-15 columns we can stack into the problem of computations and we limit it to that and edit parameter also to manage the problem of removing needed statistics we added dependency on each auto-generated statistic on extension and on index and if you will delete index you immediately can see that also after that statistics based on this index will be removed from the table so if you delete all indexes you delete all auto-generated statistics and if you delete whole extension you immediately delete all auto-generated statistics in this database also we added some command to auto-generated statistic and database administrator can make scripts according to this format and to this command just to check how many statistics he has and work out some problems maybe also we added a couple of books and functions with extension for example mode how aggressively you will create statistics in the database limit on columns who knows maybe you want 10 columns and you will work out somehow computational problems you can generate statistics on specific table you can generate on whole schema all tables in the database except system catalog and you can create a remove some statistics on some table or some subset of tables short stage-dive internal implementation during this implementation we found out that it's not easy for Postgres not natural to create an object during creation of another object and because of that we made it in two steps first step we use object access hook and gather candidate object IDs which can be used to generate statistics because on this stage we can't differentiate tables, indexes create views and so on and second step after successful execution of utility hook we already can differentiate tables, type of tables indexes and so on and we already know their names and on the second stage we create this statistics set dependency on extension and on index and adding description for these statistics and what is the reason why we started this work, this work was started because of a growing flow of queries complex queries which contains a lot of closes a lot of joints and so on and the reason for that is object relation mapping systems and rest APIs and the number is growing as I see according to our experience and I guess they will manage bigger and bigger databases and we should understand how to work out this challenge how to work with queries which can contain quite stupid set of expressions remember that Postgres in assumption that people manually optimize their queries and contains a minimal number of expressions analyzing such complex queries they divided them into two classes it's base relations, scans and another relational operations like joins, groups sort and so on for the top for complex relational operations we already have a lot of tools we have adaptive query optimizer replanning feature plan freezing feature selectivity by index a lot of stuff to help optimizer to plan query correctly but for the low level for simple scan we can't use these complex extensions complex helpers just because it adds a lot of overhead too much for simple scans and here we have only statistics and we should work with that to show you how problematic it can be I could show you simple example here I get a database accessible freely global power plan database containing only one table with a lot of columns and contains a lot of information and here we have just a simple query just simple selection and as you can see if we select by country short name of country we get accurate prediction of number of resulting rows from the table but if we add in this expression just a long name of country on description of country we get underestimation very bad situation and we can get into the nested loop scan and so on in the case of join it's just because plain statistics don't know anything about correlations between columns of course we have strong relation between name of country short name of country and long name of country and we stuck into the problem and our conjecture was databases has all the data they manage also they know all things about incoming queries and because of that future generation of databases should somehow predict see into the correlations between data and should find some templates of queries and according to that maybe they should invent some new types of statistics who knows why we use indexes as a template to generate extended statistics here is a real example of table named parcels and 8 indexes on this table it's from real database from real user and as you can see people create the indexes keeping in mind that they will get index queries according to these indexes so it's kind of template and if we will create statistics based on this template it's quite probable that we will resolve problems of prediction for most queries at least most crucial queries in this database and we try to use it and using that we made some experiment the same database global power plan database we compare plain statistics extended statistics and actual number of rows we added to the close where close step by step different expressions and you can see how prediction changes prediction of cardinality for this select as you can see plain statistics do some a bit quaint prediction because it didn't know anything about correlations and sometimes it overestimates real result sometimes underestimates but extended statistics makes quite accurate prediction all the time except the last one expression because last expression is inequality extended statistics made in a way that it can't be used with inequalities and it's a problem but with inequality inequality what can help usually we use histograms when we need to estimate inequality in general usual histograms can't work properly because we don't know anything about order causes but in our case we base our solution on indexes we know how which columns will be used in queries we need to estimate and in which order it's more important so in this case we can use histograms which already implemented in posgras core and we tried to do it with simple patch it's already not an extension but just for experiments and how we estimate selectivity see close into expression and trying to find bins in the histogram which can contain this data so we look for left boundary and for right boundary and see how much can get into this sometimes it overestimates our data but definitely don't underestimate and it's already a good result we made experiment after implementing this histogram in this extension the same the same database and the same number of causes as you can see histogram follows the estimation of extension quite properly and allows us to make good prediction on the last step where we use inequality which works better than extended statistics right now of course such good result because of specific of experiment in the database we have a lot of rows but in each column we have not so much distinct values and because of that histogram covers our data quite precisely but it just show us that we can work with inequalities and add some new case which help extended and maybe here we need to add histograms into the core as a part of extended statistics and that's it what I wanted to talk today so if you have any questions you can ask them if not you can just follow the links one link is to the extension we invented it for free use it on your own and second one it's just a company where we work and implemented it that's it, thank you any questions comments, remarks I have a question related to histogram so the histogram it calculates automatically when you set the extension or you need to specify all the columns where indexes have been created you can't scan index on second column if you don't know where your first column you just can't use it as you may situation where you have billions of records in a table and the data is not uniformly distributed over the values it's highly skewed data is there any option you can mention sampling also for histogram calculation otherwise it can go to whole billions of rows you mean skew or data on skew data is there any option you can mention how much percentage of data it should consider calculate the histogram how much data histogram calculates over whole set of data that means it will go to billions of records when you execute vacuum you use all the data analyzing and building all the statistics of course and as I remember we have theorem on that issue which states when we should see almost all the data to generate good statistics okay thank you any further questions we still have a bit of time well if not then thanks to speak again and the tool that we are going to cover PG vector before we get started a little bit about me my name is Sima Tariq and I came from Pakistan my professional life started back in 2018 when I joined second quadrant as an automation engineer later in 2020 second quadrant was acquired by enterprise DB and I proceed my career as DevOps engineer in enterprise DB I recently joined a startup named Stromatics and of course developer and technical content writer and at Stromatics we are keen to provide professional services for postgres kind of agenda that we what we are going to talk about in today's session so first of all we will see what is and why do we actually need it and also going to learn about different different different words that are used in machine learning like vector data embeddings and contextual see the workflow of we will explore the indexing in postgres and we will conclude our talk by mentioning couple of limits limits and some real word use cases of PG vector so let's get started so first of all what is PG vector how many of you have ever heard of named PG vector before ok couple of hands great so for those who don't know PG vector is just an extension for postgres so if you are wondering what is an extension an extension is simply a piece of software that provide additional functionality inside the database or you can say postgres so by default now the question is why do we actually need PG vector by default postgres does not provide functionality to work with vector data now if you are unfamiliar with vector data don't but you will get back to it later in our next slides but for now you can think it off as a for now you can think it as a oh sorry I lost my track ok so for now you can think it as a so you have your original data in shape of images, text or audio or video so you use different models to actually convert this information into some mathematical representation and that is your vector data and of course if you want to work with vector data you actually need to use a PG vector so what is vector data as I have mentioned earlier this is a specific the original data using numbers your original data can be in the form of text, images, audio and video where information where information gets converted as huge multidimensional arrays of numbers so in our if we usually work with single two or three dimensional data where examples could be so you know what day to day life we usually work with two or three dimensional data which are maps and motion tracking where temperature maps can be two dimensions motion tracking using X and Z coordinates when with multidimensional data it can have 10,000 or even more than 50,000 dimensions as well for this huge data could be an e-commerce website that is selling a t-shirt and that t-shirt can 100 of dimensions like what's the height of t-shirt, width size, brand name, color scheme etc this goes on but one thing for sure that the higher number of dimension we have is eventually means we have a understanding of original data so there is a catch between using more higher dimension versus using that so if you you have a more higher number of dimensions in your data eventually means that you have more data points to compare you eventually can get results but the time consumption in performing search operation would be huge additionally you will get more complexity and you will add more computation power and time to generate the embeddings would be huge eventually means you could get less accurate results as well so it's always better to find a perfect balance between getting more accurate results versus keeping things efficient so for example if you are more if you need that if you are looking for faster, more accurate more dimensions or if you are more if you are looking for more then you can maybe use less so let's now talk about what are the embeddings are the specific type of vector data so we can machine learning model original data and as a training process it tries to understand the key characteristics of the data in the form of text so our machine learning model will try to understand the important words that our original data contain and it might try to understand the features that our original data can contain so if original data in the form of images and try to try to understand shapes, color texture or objects from that data okay let me just quickly switch my screen and go to my terminal to see how real-world vector data looks like so over here I have a pretty simple Python file and I'm actually using a embedding model provided by OpenAI named as text embedding 80802 my original data this is my original data that is 4.5 billion years say okay this is my original data and give me embeddings for this data so I'll just quickly run this and boom this is your actual embedded data so all the numbers you see on your screen is your actual embeddings so if I just scroll a little bit above so yeah here as we can see the array of floating point number it starts from right from there and it goes all the way down till and yeah here it is so you can just copy this array and paste it into your Postgres table and that eventually means that you have your embedded data and you can of course now perform similarity searches that we will just cover in our next slides back to the slide so we have just a couple of embedded vector one more thing here it can have similar type similar looking embeddings as well so for example we as a human being know cat and lion are the same are both are the animals cat has two eyes cat can walk lion can walk cat had four legs lion has four legs so in that case embeddings for both cat and lion can be pretty much similar to both but we know laptop is not animal of course so embedding for laptop could be different in this case and finally all the embeddings but not all the vectors are the embeddings because embeddings can understand meaning behind the words but vector cannot so this is a kind of view of how a machine learning model will try to that true meanings of the words so on our left hand side we have our training data we provide this training data to our machine learning model and then this machine learning model will eventually try to understand the meaning behind behind our training data and upon providing providing that test data it eventually try to map each word with its close as possible words so in this example we can see it has successfully linked word learning with machine learning model learning supervised unsupervised and training as well so we have two major embedding models available one is from open ai and other is of course that is free one that is hugging face so initial model that was provided by open ai was text embedding at 002 that we have just seen in pretty simple demo it can generate embeddings till 1537 dimensions they have recently added a new embedded model named text embedding 3 large that can generate up to 3000 dimensions of data but of course these are not the free ones so we have to pay actually pay to use these models and if you are looking for open source embedding models then you can maybe hugging face can be your friend and this is the platform that host thousands of freely available embedding models so let's now explore the word of contextual search over here I have pasted the google search result for two queries so on your left side you can see apple weight and on your right side you can see apple revenue when we okay so when we do search on apple weight so google is efficient enough to understand okay if the word apple is if the word weight is coming before if the word apple is coming after the right before the word weight then it must display the results for apple and similarly if the word apple is coming before the revenue it should display the revenue of apple the company not the apple fruit because that doesn't make sense to display the revenue of apple the fruit so we can see that google is using contextual search here and yeah that's pretty cool okay let's now try to understand the workflow of pgvector or any vector database so let's try to understand this workflow with a pretty simple example let's assume that I am the developer here and you guys are the end users and it's my responsibility to create a perfect looking AI application that you guys can query to that application about anything regarding cities in Vietnam so how I'm going to do that so first of all I will look on the internet and will fetch all the publicly available information maybe I can get the information from Wikipedia regarding all the cities in Vietnam once I have that information I'll then divide this information into some chunks so maybe if we need information for top 10 populated cities in Vietnam then maybe I can convert this information into 10 chunks so each chunk will contain the information about a single city in Vietnam and once I have these chunks I'll use embedding model so maybe I can pick chunk number 0 I'll use any embedding model I'll just pass chunk number 0 to embedding model it will generate embeddings for me and I will just pick these embeddings and store them into my vector database and similarly I'll do the same with all the other chunks from 0 to 10 once my database is ready then end user can query my database that how many airports are in Hanoi so in that case if you send a query to if you use application and send a query that how many airports are in Hanoi then that query would eventually be sent to embedding model it will create the embeddings I will pick that embedding I'll go to database and will try to perform similarity searches on database and based on the similar results I'll try to get the results back to the end user and in that case eventually know how many airports are there in Hanoi so there are a couple of techniques used to perform similarity searches there are typically three techniques we use number one is L2 distance number two is cosine distance and number three is inner product which is of course on the next slide so L2 distance my years that distance distance is straight line between the two objects whereas cosine distance the angle between the two objects let's try to understand with the very simple example let's assume that you have two twice one is toy car and one is toy spaceship if we apply cosine distance on these objects and try to find a similarity cosine distance might say okay both shapes are different so I'll say these things are different whereas if you apply cosine distance it might say okay shape is different but eventually they are both in the fall into same category maybe up late I'm they both are not why so in that case cosine distance might say these both are the same things similarly inner product it may add the projection of two other let's try to understand it with example so for example let's say we have two cars one is blue and one is red if we apply inner product on this example so let's apply L2 and cosine distance on this example first so L2 might say okay shape is same and both are the twice so I'll say these are the same things cosine distance might also say these are the same things because of course shape is same but when it comes to inner product it quite says shape is same categories same but the color is different because one is in blue and one is in red so I'll say these are not the same objects so okay so in that case inner product might say these are not the same things let's talk about some indexing there are mainly two types of indexes available one is IVF flat and the second one is HNSW index IVF flat was first introduced in 0.4 version of PG vector and HNSW was introduced in 0.5 version IVF flat is a list based indexing technique and HNSW is graph based eventually both are both are used to perform approximate neighbor searches which means that they retrieve retrieve closest possible vectors inside the database here I have tried to demonstrate the here I have tried to demonstrate the visual representation of IVF flat is an index so initially you have your vector in X and Y axis and they are scattered points then we use we use to find a couple of clusters in this space and once we have the clusters then we calculate the centroid for these clusters and when we have centroid we link each the data points to its neighbor centroid this is how you eventually end up making IVF flat index if we talk about HNSW index it is of course a graph based index we have couple of we have I have shown total number of three layers you can have a number of layers so on your layers you have all of your data points and the moment you go above you lost your data points so all of your data points are not only linked to its same graph but on horizontally and vertically as well there are couple of limitations of our PG vector as well like we can only index vectors up to two thousand dimensions but if you have more than two thousand dimensions then you might need to use dimensionary reduction or partial indexing techniques with PG vector and it also doesn't support in database embeddings which means that you have to rely on some third party tools to actually generate the embeddings where to use PG vector so you can ask couple of questions from yourself like alright so if you are dealing with large amount of unstructured data of course then you can use PG vector or if if you wish to use similarity search inside your database then PG vector can be your friend as well yeah so similarity search is recommendation systems and anomaly detection systems can be an example for PG vector so we have lantern and PG vector dot rs that can be used as an alternative for PG vector and the fun fact is they both claim to be perform faster similarity search as compared to PG vector so yeah that's all about it if you wanted more information about what we do you can scan this QR code and here is my LinkedIn profile just make sure we connect to LinkedIn yeah thanks a lot please thank the speaker any questions yeah does your team have any plan to leverage does your team have any plan to leverage GPU support GPU for speed up computation yeah I get this there's PG vector support now currently there is no GPU support available in PG vector now I don't have much information about GPU support in PG vector or maybe in the pipeline but I think I look into it and maybe we can talk after this thanks it works only with digits or with strings and chairs too so we can with PG vector we can store digits in the form of embeddings and we can also store original data as well inside database so you can store both of these things it's possible yeah so keep in mind we have some issues with storing in the arrays vectors and making selectivity of them in core do you improve somehow in that way in extension your core for example add some selectivity strategies maybe some custom heap or custom access method so in order to perform the search is the only recommended way is to use indexes of course inside the database so you can use ibf flat hsw but hsw is of course most recommended way to use because it is faster than ibf flat index and it perform similarity search is more efficiently and in more accurate way yeah i hope that's answer your question any further questions i just wanted to confirm something you passed away too fast sorry you said there is no in database embedding what did you mean with that so to generate the embeddings we have seen on my terminal that we have to use some third party embedding has to be generated by some third party tool either it can be open AI from hugging face yeah any further questions well then let's thank the speaker thank you so that concludes today's session please a few things don't forget we have for the speakers here we have an event tonight you are on the tickets page you should be aware of that we have tomorrow a continuation here in this room and i think that's all thanks everyone for the first day i hope you enjoyed the first day at first asia and we see you tomorrow