 All right, hello, hello, okay great, I mean, it's like two minutes technical, I saw the time I can yeah, I just probably need hand mic for Q&A if any if any will happen Thank you so much. Now really really appreciate it Really appreciate it No, I don't no, I don't no, I don't I have my I have my finger I'm good. Thank you Yeah, yeah, okay I mean actually I can try this one. I guess Hold on. Let me see you because I tested mine. I did not test this one Now this one does not like me this one Yeah, we do not do this one in process All right, awesome All right So can everybody hear me? I assume yes, right? Thank you. So since there is no moderator today I am just starting because I always start on time. So if anybody been to my Chicago Pug PostgreSQL user group, they know I start in time even when the speaker does not show up I make a quick substitute because it's a way to keep people like, you know, organized because you know Other people need to be done and get on the train or whatever So anyway Good morning, everybody Thank you so much for coming for the first talk of the day and especially coming into this like room Which is a freezer So like I know that some sacrifice choosing between this room and the other room So I really appreciate you choosing my talk. So Today we will be talking about securing your postgres data I and I'm delighted to speak at scale For the first time in my life. So it's all new for me That's huge conference and there are people of whom I usually do not meet at the conferences. So thank you everybody Who is at my talk for the first time? So I'm happy Do not bother about last name. There is only one Haiti in postgres Google Haiti postgres. You will know Everything about me more than probably I want to know so there will be plenty of information So I'm Haiti on LinkedIn Haiti on github and Haiti everywhere So I've been doing lots of stuff with databases in general and postgres specifically And at the moment, I'm a database architect at the DRW holdings in Chicago They said they're going to change my title because nobody knows what architects does and rumors are that data Architects do not do anything which is not case in my case, but anyway at the moment. I am a database architect So more important title, which I hold longer than any of my job titles I am the local organizer of the Chicago postgres quell user group And so I'm leading this group since December 2016 and I did not check recently at some point it was second largest user group in the western hemisphere I think there are more a larger group now, but I am happy that I led this group through pandemic We actually continued to We switched on to zoom in April 2020. So I missed just one meetup and since then we are meeting both virtually and in person and during pandemic I Promise that I will always live stream my meetups when they will return to be in person and I still do. So if you are interested We are meeting once a month. So go to meet up and find Chicago postgres quell user group And I will always, you know have link to our zoom version of meetup. So even if you are not in Chicago, you can Join me virtually And another very important thing I'm doing in my life is I'm the operations part of operations committee of PJ day Chicago and That is by the way only the second PJ day in the United States I was shocked when I learned it because there are so many PJ days in Europe and So there are only two legit PJ days in the United States PJ company work and PJ day Chicago. It is happening now for the second time And so far it's a Monday conference So please support it because if I will meet the goal for this year, then hopefully next year I will be given two days conference. So I still need sponsors. I still need Participants registration is open. It's still like plenty of time If you can do even like small sponsorship, please sign up if you want to attend, please sign up I know the program is awesome. So check our website. It's like a great program Free tracks and I returned it back to Chicago loop. So at least you are not looking at I'm a gosh I don't want to look at the windows You will look at all the like best of Chicago from the windows if you are not looking at what people are presenting So please come and join me Some Why this talk So for those who know me, I'm all about performance. I'm like Performance like whatever the same performance queen, whatever I call this Russian lady who comes in like, you know, I'm like Tell tells people like how things should work. So anyway Normally I'm doing talks about performance and about my Projects I have several open source project which I love. So that is my first security talk So why I'm talking all of a sudden about security. I never did before. So before I will go with Official slide here is an official story at some point during my employment with the RW holdings I had to automate security settings, which all this talk will be about and When I started to convert databases from disaster to proper security I found so many issues with how postgres is doing security. I was like shocked. I wrote a blog post and it was like You know, it was not a Wordpress discussion. It was like Facebook style discussion people. They came very angry people were writing a long responses to me So that was a disaster and then I like, okay I need to let the world know and I started to submit the stock and I got rejected rejected rejected So I talked to my friend that is Avro like I got rejected for the fifth time What's wrong like how I can let the world know what I think because otherwise nobody will listen to me and She said, you know what first of all, you do not call the talk What is wrong with postgres because everybody will get like upset and like, okay It is something which is wrong with postgres. So how you want me to Do this and she said, okay, you know what? Like use charge EPT. That's what I do with all my talk descriptions and you will get accepted I like no charge EPT is not me. Everybody will see it's not me. I try try so I tried So charge EPT Replies something like, oh my gosh, that's not me. No, I'm not doing it and I fought for a couple of days Okay, I will take charge EPT version. I will make it look like me and this one got accepted So that's a charge EPT plus heading. Okay. So finally, I have a chance to talk about what I wanted to talk for like, you know For a whole year. Okay. So now official version. Why I'm doing this talk We live in the time of Age of data breaches and I do not even need to Indumerate any recent ones because everybody knows Like everybody heard many of you have been like a victim of data breach. So we know, right? so and So securing that is high priority like anywhere here. We have the whole track about security by the way They didn't get they did not accept my talk postgres accepted my talk the track The security track on scale said it's not important You know why they said it's not important because they believe that only thing you need to do you need to secure your Network you need to secure it from intruders from outside from inside the moment you get into your super secured network You can do everything because now body protects data on the database And by the way postgres has everything so you look at any other database system postgres has the most like sophisticated most flexible system to provide the permissions in very Fine granularity exactly what you need postgres has it all Still you know what happens in real life not in theory not when the contributors tell us over the draw level security So you know what happened happens in real life, right? Who knows what happens in real life. It's interactive Who knows what happens in real life? No, that's what happens in real life like Like production applications, you know what my internal customers come to me We need to install this application. We just paid tons of money I'm like I cannot install it because it's like this, but yeah, that's they require they require super user They require public schema okay, so and You know why it happens. There is a reason why it happens There is a reason no matter how hard we try this happens and I will tell you why and Just to start what will be covered so we will go through Most important security challenges, which is like what I said yesterday I want this to be part of postgres not hate is fixing postgres and We will like I will explain the necessity of standardizing approach And I will show the security models which we adopted in dr. Double your holdings and how we use Automation and what is next what I'm trying to achieve except for making half of it part of postgres So The reason why it's still a public schema user postgres There are several challenges first challenge is You do not need to have any security to start working with postgres So postgres has very low entrance barrier right we know because nobody needs to learn postgres You get it you install it open source and then maybe you start reading the commutation But for the moment you installed it locally you are automatically connected as postgres. You are automatically like within public schema go okay, and If you decide to look at the commutation That's the table of contents of the commutation Anywhere anything about rules users permissions Tell me do you know where is it? I know actually where is it like nothing, right? When remember, okay, most of us have been with Oracle at some point of our career. Let's be honest. Let's admit it So remember when you start with Oracle you do this cot tiger thingie. You cannot be like nobody, right? You can be nobody or you can be everybody you can be super user in postgres So I will tell you way in documentation like the first word about privileges appear In the middle, right? So you are done through what is postgres how to use a quick start everything all of a sudden Oh, now we can talk about users and if you click on this Do you see where privileges are? Not in the beginning right here So you're kind of like you already stopped reading the commutation because you did quick start You know how to do everything Yeah, there's it and the most alarming thing So, you know how happy I was when Magnus just announced that the special status of public schema is removed in postgres 15 still still in 15 in 16 all Examples in documentation are from public schema like like look for it Like nobody ever mentioned you can create something other than a public schema goes by default. So Then we have hard time telling users that that is wrong, you know, I heard like, you know users like okay You just told us the schemas exist. We did not know before like seriously. I'm not even kidding So so that was challenge number one As a result what happens? Yeah So applications are developed using postgres user because it's easy when they go to production They don't have time because like it's deadline its deployment and you know, we will lose money if I won't deploy today You know, no matter how long you develop, but we kind of do anything. Okay, then the other thing like for some reason who ever Designs like country builds to postgres. They forget that in real life. There are no users typing There are applications using connection pool. So forget about roll rail security, whatever Everybody are connected using the same user and how often this user is postgres more than you want to know. So It's the same database user Challenge number two the wonders of inheritance So We did it. We did it long time ago Actually, you know yesterday was the conversation who started when I was still there at seven three I remember when we moved to it. So the Difference between groups and users was eliminated in seven three Because now we have every everything is a role. So now user is a role with login and we can Grant anything to anything anybody to anybody So if we create roll one then that will be roll no login roll. We create roll two With login so that is a user. That's okay We create user one is the same as a roll with password. So that same roll So now all of these grants will work postgres does not see anything wrong with this So you grant roll one to a user you grant one user to other user by the way forget you Secure this user with password you granted this user. So another user does not need to know this password Go and at the end wonders of inheritance you create another role and you Grant this role to roll one and when you grant you do not know who else owns this role one who else has it so everybody will grant Will receive this grant everybody who have roll one will have also roll three and when you will grant emissions to roll three everything will go up and You do not even like No, what will happen? Come so You know it creates total security Hazard like in two steps two steps is enough to create total security hazards in your database like the one horrible thing I had in real life not like recently but I Accidentally everybody say like public was granted a super user Like by the like by ways of multiple inheritance So another thing which for some reason okay people who work with postgres professionally know but for some reason everybody forget People think that user is created for database because you are connected to a database and you create user and You do not created for the database because users are created on instance level And I cannot tell you how many times people say we want this in separate database I like why for security reasons that there is no security any user you create have automatically connect Permission to the new database and yes, you can explicitly remove it But when you create another database it will get you need to know what to remove So you need to know from which users you want to remove access So you create a user by default they have access so they have connect And if you have public anybody from any database Anybody can create anything in public schema of any of other databases and Also Sometimes you we need super user because of whatever so if you made this mistake This super user can do anything in all databases on this instance. So hello, no no security here now We actually know how to the things the right way and until I started with the Rw holdings I thought I know how to do things the right way because when people know that We need to use schemas as units of security. We need to have an owner of the schema We need to have roles. We are not granting privileges to users directly. We create a role Which holds the privileges and so what I'm going to demonstrate now it is like Written in many many blog posts not in post this documentation because we are flexible but like Any blog post from anybody who like tries to do it the right way? You will find something like this using schema and granting roles to users So you will have something like this you create a schema owned by Honor roll you grant this role to a user you create read write role and you create read All the role and you operate like this. So that's like not new. That's not hate invented. It had been around for very long time Okay Now and what you do you grant select on all tables to orders read only your grants selecting sort of daily to tables One all tables to orders read write. So What won't work What won't work it's interactive Trank it. Yes. No, but not only what else truncate. Yes. Thank you. What else? What else will not work? Hey, I play any application developers here Thank you. Thank you. Yes. Yes. Thank you. Yes. So yes The full privilege challenge and by the way one thing everybody forgets here grant usage cause like I still need to understand why it has to be grant usage But when you grant read write whatever things do not work and then yeah, you need to define default privileges because the previous grant this one which we had here, okay? So here we only granted Permissions to tables which were already in the schema. So if we just created the schema, it's nothing So we need to do default privileges, okay? Default privileges out the default privilege in schema orders grant select on tables to read user grant everything else to read write and This should be Okay, because now what was saying any table which we create and views by the way views are also tables in this situation Read user will have select right user will have everything so And remember we created role we created user some all good, right? So user there is a user who can create tables We create a table By default permission of the plate It's interactive by default permission of the plate You created table they're not applied. Why? Okay Because yes No No, no, no all tables so good. No, it's not it's I created I created a table So remember I had the owner user on a roll. I did alter So I'm admin who can create I create table default privileges are not applied. You know why? Yeah, I mean What matters is that you need to grant default privileges not to whatever But for specific user because if you issue this alter default privileges You know to whom it will alter default privileges to the user who was doing this So if you have a database owner when you did this out of schema, it will grant to post grace to whoever not to intended so now Yes, that's what you need to do you need to alter default privileges for the schema owner because you want them to be kind of like encapsulated on this database on This actually leads to other Citation because if you are now admin user you need to explicitly switch to this role because otherwise you can create table But default privileges won't be applied as well. So they're like two things which can go wrong All right Wonders of ownership. That's the thing which blows mind for everybody who comes from Oracle so we own Schema right create schema orders on orders on So this is created by default. So you do not do this grant all on schema orders to orders on that it is Im implicitly granted when you create a schema. So this role orders owner can create all objects grant has all selected sort of update delete truncate whatever some Now and I face it all the time. But like people are my gosh, we actually need to reduce security if you do I Also schema orders owner to New orders owner who knows what will happen with permissions Anybody knows anybody tried it? Who knows what happens with permissions? Okay, the answer is nothing happens with permissions like literally nothing So this schema this schema will change the owner But the old owner will still have all set of privileges So in order to switch the owner what you need to do you need to rework Everything from the previous owner you need to read catalog go through all the objects You need to switch in honor and yes, there is Reassigned command reassigned all it does ownership it does nothing with privileges So you need to explicitly rework all the default privileges grant new default privilege So it's huge project. So like one of my asks is why we if we want to be flexible why at least we won't allow some flag hi Like switch ownerships like cascade or something like this because that is a huge project if you have lots of objects in the schema so and Lots of other things and I can talk like the whole hour about all other weird things But like just a couple of things, okay for example, okay, so you granted select on The table to a roll Then you grant insert update delete on the same table to another role you create user You grant this rule to to user one. So now user one can update Insert update delete so then office is actually we do not want this user to do delete What will happen if you rework delete who knows what happens? Yes, correct. It does not do anything. It does not do anything because we never granted delete We granted a role which has delete delete as a separate privilege was never granted But do you think postgres tells you you are doing something wrong? He does not Not all it will not work. There'll be no error. So you execute revoke it will set Yes, revoke revoked revoked and then this user can delete. So that's like one thing. I reported long time ago And that's like what I was referring to. Yeah, that's probably wrong But we have like 120 other things which we need to address So if you revoke permission, which you did not granted or nobody granted it would not tell that no This is wrong. It's revoked all good So and also if you grant the same permission multiple times the only time when it will report or you already have it is when you grant the existing role one more time okay Then in addition other challenges You cannot drop the user Which has any privileges? So it is a mistake, right? It will error and like you cannot drop this user You cannot drop user cascade because you know with all other things you can drop user cascade You cannot drop user cascade. So it will not you need to revoke all privileges explicitly and remember They can be in different database now And most interesting there is no easy way to tell Which permissions this user has like trust me because I Wrote a function which lists all the privileges for the user's function is like about hundred lines long and other people who try to do the same they kind of end up in about hundred lines of code just to list all privileges for one user It's all okay when you have several databases, okay, so I have 280 Approximately, we are not sure how many we have in organization, but something along that lines and We have a team of Okay, we have a team of very little number of people because also postgres is not our only and not even our most often used database in our organization and we have total debate team of 14 people like for all databases Again postgres is just one and not the biggest so What we're going to do First of all I work in trading and in trading people are super secretive They want them to be isolated as we already know databases Separate databases would not help separate instance switch project I had internal customer who requested New instance literally for each new pair of tables because they wanted to be separated So that also kind of expensive even for trading So what are other approaches? so that's kind of like what led me to Developing this security framework again Why I'm talking about this 280 instances. So there is no room for error Previously I could say hey, you know what you did not switch to correct rule I'm sorry. You'll need to start in your I cannot tell it to owners of 280 databases So I need to have some automation which does not allow anything like right and left So here is my security model. So It's kind of like not really open source. So I only will show the pieces of code So I will try to show ideas Rather than how it is done. So if you want like more clarification, I will try please like stop me but like Just telling you it's not exactly pieces of code is more like what we did So what are basic principles so first of all principle of least privileges there's like a Principle of any security settings. So user is given minimal level of access needed to perform their actions So if you do not need something you do not have this only exactly what you need not more second is durability which means it's like with right again I and I promised to take a person who will break it out to fancy restaurant if they will break so so far nobody broke it Because I like I challenged my DBA team to try to break it without being a super user. So it's pretty much unbreakable and flexibility because different customers have different needs in Like securing their data. So I was struggling with different package version. So the Version which I have now is flexible. So just using two flags there I can satisfy pretty much everybody's security and permission needs Now what are the key features how it works? So first of all again, that's my big ask a postgres postgres I I do not want to create event rigors for this. I actually want it to be kind of You know, you you should be able to set this up in postgres But now I have an event trigger which switches permissions when object is created So pretty much what it does it Forces permissions to be correct in the correct schema Then I have security level matrix and I will go through this matrix like in a second Security is on the database level. So not only is this level on each database You set security package and you can set your security level on each database and How it is all executed? I have a set of security definer functions Which do everything instead of user issue in command. So I'm not giving anybody the super user, but users can create schemas grant permissions create users all the through security definer schemas and So to define the functions and they can only execute things which we allow them to execute Yes, yes, yes, yes So enabling so there is a security package which set a function It is by default installed in each postgres Database and it is dormant until we explicitly initiated by a setting it up with security metrics So you cannot change settings after they're set once for the database otherwise. Yeah, it's like It'll be bad so there is like one function which postgres user executes Grants create schema users and it uses permission settings Which I will show on the next slide the permission settings are recorded for this particular database and If you redeploy this package because the biggest challenge I have I made changes in the existing function So I do not want to break it and start over. So if I already had security setup the installation script picks it up and Reimplements it so I do not need to do it. I can redeploy multiple times So that is the matrix and I will highlight all of them So we have a schema owner and account owner fingers. So how it how it will happen. So first was this Schema which I described previously. So database owner owns all the schemas and grants Read only or read write. It's a covers most of the situation Then one of my internal customers came said no We actually want to be isolated per schema because there are different teams who are creating objects And we want them know nothing about others and if you can do this for us without us being creating yet another Instance, it would be great. So there comes a model number two schema owner through Where each schema is owned by separate user schemas are still created by admin but This is scheme is owned by its own user and gives its own privileges to them like application user then There's another customer. Okay. Ah, you know what actually I want something different. I want to power users Which can each can create their own schemas, but not knowing anything about each other So then it was model with account owner and account owner has the schema ownership And then there was the third one more user And they said we are selling services So we want to create Accounts in this accounts can create their own schemas and again They can do everything within their own schemas, but know nothing about each other. So I like said, okay and That was kind of like finalization of this matrix So at the moment like again if somebody knows anything else, please tell me but so far this Matrix covers like everything what what we can do. So let's Okay, so let's look at what is done for each of them. So in first case false false it is each schema is created by Like database owner and this schema is created with two users read all in the read-write and They are the rules are granted. Oh, sorry. The privileges are granted to these roles So that's like classic everybody who does security on schema level does this So next month again database admin creates Like user segment creates schemas for each schema it creates the schema on a roll and same story creates Read all the roll and read write roll And again here the owner is the schema owner so in this situation We create account and account has like some privileges privileges to create schemas again via security define the functions and I also creates a read all in the read write users and this last one We each schema is created by account owner but for each schema they create separate schema owner and all the default permissions are like Cascading in this manner and in this case the account owner can do everything with their schemas but the Schema owners can do like can do everything within their schemas. Okay so What are the functions doing? So they honestly the biggest challenge for people to adopt it when they realize what the advantage is it gives them they love it but The problem is that they need to call functions and again, I would love it not to be functions I would love it to be like you can switch to some Flag or like option on create schema create table whatever and I'd be happy but anyway So create schema roles as first function. So whoever Has a permission to create it they can this function will create schema it will create a Application user it can create user and password at the same time doesn't have to it can create a schema owner if the security level is what schema owner and can create the password and It might Desegnate it to specific count again if account is applicable It is defaulted to the current user and you cannot really change it. I will tell in a second. Why? So again every the call is the same everything inside is driven by the existing security metrics So which roles are created which not? Drop schema roles. So remember you cannot drop Anything without dropping all the privileges and the best thing is if you drop schema, which you can Then you have all these remaining users and you cannot recreate the schema because you cannot recreate this users. So Instead of drop schema you call drop schema users and you pass schema name as a parameter So what it does it revokes all the associated roles from the users And it drops this roles and it drops the schema. So all of these things. Yes clean clean drop Then assignment functions so you can assign the Schema owner user you can pass schema name the user name and the password So here is what it does if it's a new user It creates a user with a password if it's an existing user it will Like if you do not pass the password you just assign the role if it is Existing user and you pass the password it will change the password. So that's how it works So same for the application user so same story, but for the application and obviously there is one function underneath of it But our users do not need to know this one function with the rules at all and For it only user again same story. It's underneath is the same function and Also, there are all these revoke functions So same you can grant role and you can revoke role and again everything is done kind of in a clean way So we do not drop users because remember we do not know what is else these users do because you know There are some creative people who actually reuse the user. So we're like doing it safely All right so And additional security define the functions again those are the functions which are like very very very needed So first of all select all privileges. So if anybody in the previous lives ever saw my diff Project I like create these differences between different Instances so this function is copied from the diff and it does list all Privileges kind of like disassembled to atomic level So no matter how the privileges are granted by roles or directly So this will list everything everything everything. So it's a good self-check and the other two Seen blocking processes and prejudices that activity. So why we need this because By default again, why not giving people super user privileges? otherwise, you can always see your own sequel and If you like run it, you will see if it's not you there will be no sequel and displayed because it's security and Admin wants to know who did this. So this function works as a security definer as Like postgres and it can display all the code of the queries. So that's Very important in our Citation because you know in the queries you can see all the data and you know what again? Work in and trading I see all the data which I severely want to unsee but I have to see so So no nobody except of their admin like is entitled to see things All right. So code details Again, it's no magic. So how I am forcing privileges again, that is the part of the event trigger So when you create any object, it's not one create It's whole bunch of crates because when you create table, it's create table Create constrain create index if you have some primary key Like or unique key and whole bunch of things So event trigger on create goes through all the objects which are created Changes ownership and permissions when it makes sense because there are lots of little things like you do not change ownership on index But index needs to be created after ownership is already changed. So everything is like done inside this trigger So no matter who you are no matter how many times you forgot to switch to correct role If you own this role if you have a right to create objects in the schema The permissions will be the ownership and permissions will be assigned correctly using the event trigger Now how I know that you have a right to do this. That's how I know that's like again part of the code which That it's standard recursion which identifies which role you have Plus if you were smart and explicitly switch to the right role, which you do not have but like, you know you should So this basically returns true if you are allowed to do things in this schema now In other thing is so that is like you are allowed but how you know you're allowed or we have the security metrics we have the account owners So I need to know who you are But now you are about to go to the security define a function in security define a function when you are inside You already postgres. So how I know who you were before you come So I actually have non security define a function on top which passes me a current user and then Okay, if I'm allowed to do security define How can I make sure that I am not nobody calls it directly and for these I am actually checking the stack So security define a function can be only called not stand alone But from non security define a function this way I know that I captured who you are correctly So and then I know are you allowed not just to execute the security define a function, but to Executed on this particular object. So you cannot drop somebody's L schema you cannot create objects in somebody's L schema You can cannot assign new user to somebody's L schema You can only do it as you and by the way postgres as well Like it's like disappointments for our debates, but yeah all right So This is like again very Briefly how it works and I know it was not enough code. The package is pretty gigantic so What is next but this so first of all Users are keeping asking me about reporting and although I have this list of privileges, but it's on my backlog to Create reporting which tells you okay this database has this accounts this schemas and those are the people who like Have this role. So nice Jason is due it will happen You need testing you know nobody wants so first of all people do not understand that you need unit this It's like the whole big thing. Nobody wants to do this not like I love doing unit this but with this complexity I actually had to do this. I am like 80% done with unit this No, it turns out each time that I did not cover some of the cases I have like about hundred tests for each of the Versions of this matrix and it's still not enough. I'm still finding cases. So hopefully it will be done soon and Conversion automation so if the big thing is that we have customers who are not on security framework and now we're making it as a company-wide standard and We want to automate the conversion from God knows what like literally like you cannot imagine what what can happen to proper security Framework so far is severely manual process because again People are so creative. They're like schemas own by login user. Then you need to do several things There are objects which are owned by different users within one schema So fun part is I try to kind of automated create functions So when you have partition table and partitions are like several thousand and we have it and honest to God It's legit in many cases then I'm like running off DDL locks in transaction So I actually before I do this I need to investigate how many skimmers have Partition tables and then I need to change ownership because yes all partitions ownership is separate So so far. It's like very very loosely automated process. So That's my backlog again most importantly I would love to have Lots of these things as part of postgres like to summarize. I do not think postgres should revoke things Which do not exist silently. I think there should be some hint I would love to have an option. I created the schema Please make sure all objects in the schema are owned by schema owner. I would love to There is a minimum, you know, I Talked to many people they already told me that Preventing from granting privileges directly to user is not working since we did what we did with roles and users So and I know that you can create something there's no login role and then you can just Add a password your login role great so like Yeah No, no salvation that but You know, it is the way about doing it now it helps and each time Any of our internal customers kind of like oh my gosh, what's happening? We like do not know like the law's permission here like permission granted, but they were like, okay switch to this this framework gives some sanity to the base on support because at least we know what to expect and again We know that nobody even other to be on support who does not know I Had so many calls from DBAs on support. Why have this? Because you are not allowed to do things directly So yeah So that's that's it Thank you for listening hopefully it was interesting in the Questions Hello, as far as Visibility into Issues you you have presented Do you or have you ever considered using like a threat modeling tool or other command-line tools to Identify permission issues or other issues with databases like Microsoft threat modeling tool or What did it work? No, okay, so permission issues happen So the thing is like everything what I demonstrated I came across when I was trying to convert people and was trying to switch permissions I found so many things. I like don't want to know how many I found so Then the question is like how you That's a big problem because there is no like command in the sequel which allows you to list all permissions You can there are some sequels like this recursion which allows you to list all roles But that's pretty much it you cannot really list all atomic permission. So I have this deep packages open source So again, it's one heading post this Haiti dig it up There's deep package the IFF and I have it there. So that's a Function which you can like deploy time on your database. It's not an extension And it will issue all permissions. So other things like maybe there is something else I'm listening all I found what is that inconsistent Slice will be okay if they won't be there within the next hour you can find me and like beat me Okay, I was making changes like until this morning because I forgot to just to display Do does either you or the Chicago group that you work with have any fabulous github repos where we can borrow some of your code Do you or the Chicago group have a github repo where we can borrow code samples Okay, so this one is not open source because I developed it for the company and I still need to figure out what I will Outsort like outsourced what and all my open source projects are in my repo I have five open source projects again Haiti dear I'm visible I will try to open source like at the moment is proprietary code, but I'm planning to open source as much as I can at the moment is not So for the security piece, did you reference any like mist or? CIS or any security related templates for to secure your your module What wait no No It's all like in big so pretty much, you know like first a security scheme is everybody using and I started to use it I use it my free previous jobs and startups and then all of a sudden I learned that people need a way more and I just started to Come up with models and then I generalize them So so far all I have within this like 280 databases fits in one of the four Do you have any strong feelings towards? Role level security Scarecrow things about security. Yeah, how do you feel about role-level security? RLS role-level security where restricts So, okay I can tell how I feel because people whom I want to hear this are not here as I will tell anyway This is one of these things when people who came up with this they Do not know how people use Postgres in the real world because it's like not needed it's like like not needed 100% because They are not users who are sitting in typing commands. They're applications and most applications like if we have like one User per application. It's already huge plus very often. It's like one user for all applications So forget about this. So I think it just like one of these things when I Try to be voice of reality here, you know, so I'm not a contributor I'm nobody so so then do you think it's a bad idea to have many different users accessing a same Same goal database where maybe no, no, no, it's great. It's not it's a great idea to separate by yours, but the World does not work this way again. I'm in the real world. I'm a realist. I know that There is where I'm no my no matter how much I hate or M or M is there to stay no much How I hate many things which are Out there they are here to state and my goal is to assist Application developers to do what they need without creating hazard in the database. That's all I'm trying to do So like a low level security is something which like I was always wondering What was the idea behind this because it's like absolutely not like applicable to real life Okay, no, I'm happy if somebody does awesome any other questions All right, thank you. Thank you. I think so. Thank you so much Test test test hello test There we go. Cool. Maybe Yeah, I feel like I'm at the dentist that is definitely not comfortable test test Hello Oh, it's it is on. Oh turn it off. I don't know like Turn up the volume. I don't I don't know if I could do it from here. There's probably a Or like again, I could just do the handheld mic Yeah, but I could it'll be easier to control Yeah test test cool Much better. All right. Good morning. How's everyone? Good cold. Yeah, so I apologize. I unfortunately I don't control the the temperature or the weather so There's not so much I can do. Yeah You control the okay, so Devrim can control the weather. So thank you Devrim Hopefully you'll get a little bit warmer, but this is a very hot topic. So that will also get it warmer So I'm Jonathan Katz. I work at AWS. I'm also on the core team of the Postgres project and For better or worse, I have been staring at this problem of vectors for well over a year now and I'm very you know, I'm very excited to talk about this topic today, which I'm terming vectors are the new JSON and You know maybe before we start like it sounds kind of weird like what do you mean vectors are the new JSON? Like that would you know that that sounds like you're just trying to you know Do something that's like click bait or you know get attention to it But you know allow me know please know entertain me for a bit. So here's a JSON document And you know this seems you know it's you know has some content in it you know, it's about Postgres you know it contains something about supported versions and We saw at one point that We might have like an application that has this JSON document and you're like okay cool Well, I love using Postgres. I'm gonna take that JSON document. I'm gonna turn it into a relational structure Side note. I think that pretty much all JSON can be turned into a relational structure and you know stored in a database Cool But then we just got the notion saying like hey, you know, that seems like a lot of work You know, I already have structured data might not necessarily have a you know a schema as we as we know it from a database standpoint But it has structure. Why can't I just take it and store it directly in a database? So people did that, you know, we saw a lot of these JSON databases or document databases come up in the in the late 2000s and You know, you know likewise Postgres ended up supporting it in part because you know as I could say right here, you know JSON's a data type It's a complex data type But it's just a data type and one thing that Postgres does very well is that it stores a data type and it stores it you know, you know, it basically has a representation in its storage that allows you to query it and Manipulate it and get all sorts of insights on it And that goes beyond just Postgres, you know to be fair that when you have a data type and you have a database or a data storage system You can store it. You can query it and you do all sorts of things with it Postgres has a long rich history with JSON. You're starting back in 2012 I mean as JSON really became popular in the late in the late 2000s with web apps We saw like a huge demand for folks who just wanted to store their JSON data and Postgres because it either lived next to their other Application data or it was their application data and it was just much simpler for what they were doing to be able to query it directly What was cool is you know Postgres was the first relational database to to be able to store JSON and lo and behold It became part of the SQL standard, which is pretty neat Now fast forward a bit right, you know, you know, this started back in 2000. Let's go to 2024 and This notion of this like kind of weird line has become very popular you know, this is a vector and You know to the best of my abilities, I drew a two-dimensional vector of you know, 0.5 0.5 coordinates or attributes bear with me and The vector has two key properties It has magnitude or size, you know, you know how big it is and It has an angle so basically you have some kind of size and you're pointing somewhere in space It's kind of cool because you can add more attributes or more dimensions to it and you could see that okay You know here I am in three dimensions or hopefully you can see the three dimensions, you know I my parents were grateful that I didn't go to art school and You know, you can see that I'm you know pointing, you know, I can see where I am in space Vectors are kind of old, you know, JSON is a newer data type Vectors there is you know, they have some history to it. They've about let's call it like 120-130ish of history and there's this book in 1901 that you know codified the early mathematics to vectors now Back then you saw vectors in more mathematical applications around physics You know, you kind of study vectors when you're learning about forces in physics or even in you know You know translate back to you know, introductory computer science classes where you learn that you can have a vector is being a specialized form of an array but Vectors have become very popular as of late, which again is interesting right because I said wait a second like you know We kind of learn about vectors indirectly in an introductory computer science class, but they're a big deal now in these storage systems and It really is because of generative AI which in some ways I take a step back even further I think the reason why it's become very popular is that it's much simpler to use AI and ML systems today True story back when I was in college I thought I was gonna go into machine learning like I was fascinated by the fact that you could like take these systems and You know learn on them and be able to create inferences or you know some kind of artificial intelligence is something that really fascinated me But at the time I was looking at all the research going to machine learning And I was like oh everything's ad hoc whenever you want to do something you have to build your own model It takes time to train it. It's very exhaustive, you know, I'm impatient. I'm gonna go into the startup world I'm gonna build web apps, you know yada yada yada And of course, you know my you know I kicked myself as I didn't have the foresight to see that this would be commoditized that Using these models would be as simple as making a rest call And that's really the transformation in many ways is that yes, these models are much smarter And they're able to get these inferences that are much deeper today But they're so accessible like we can use them very easily like you know We can just go right now and you know you probably going right now and have one of these models build out the talk That I'm giving right now and that that's really cool But the other key thing about what we're seeing with you know the big foundation models Is that they're trained on public information or information that is widely distributed, you know I like to think of like the entire internet as being that kind of database But there's a lot of information that's in private databases information that you know you've been collecting for your applications or You know or whatever you know whatever things that you're doing that the foundation models don't know But yet There should be a way to be able to take that information that you have in your database And be able to use these foundation models to be able to build that generative AI experience or that richer you know add a richer AI experience to your applications and The example I like to give is something called well the technique that's become very popular is retrieval augmented generation Which you know, I'll explain a second, but I'll start with an example So when I first built this out, I was sitting in my in-laws sunroom in Florida and my father-in-law loves to collect You know all sorts of like Florida chotchkies, you know, that's like alligator heads to cans You know all sorts kind of birds, so I'm like, okay Let's say I had a store that was selling Chotchkies and I took the example of like a blue elephant face because you know blue elephants I kind of like them and It's like okay. Well, let's say I go to a foundation model and now the blue I'm like how much is this cost in My store and the foundation model is gonna be like, I don't know. I don't have access to the information in your store I don't have access to that database and that's what retrieval augmented Retrieval augmented generation comes in which I'm just gonna say rag from now on because I'm a trip up over those words But the idea with rag is that you do have information in a private database that you can use to augment the model So in this case, I have a product catalog. I have pricing information I know how much a blue elephant face cost and when that question comes in I can basically take that information residing in my database and Put it into the you know the foundation model to be able to trigger the response to which I could say Yes, a blue elephant face does cost about $20 and That's a really cool part is that you don't need to have to train your own foundation model to get that experience You can take the information in your database and be able to augment that Then the question becomes that's great. How do I do that? And that's where the vector comes in We basically need a way to be able to take the information on our database and have in some ways like a lingua franca Or a universal language to be able to move the information from the database to the foundation model on back And you know, that's where this you know this that's where this occurs in the rag workflow so Rag occurs in two parts The first part is you need to be able to take your raw information, you know, it could be unstructured data And what have you and be able to create a numerical representation of it? And you know today that numerical representation is the vector So that's that first part down here, you know the example here is like I have a bunch of PDF documents Let's say this is like my store catalog You have to do something called chunking it up because the way these embedding models work is that they look over a subset of the information and then they're able to assign that numerical representation to it and You put put into the embedding model and then you store it in a vector database or a database or you know We'll get more into that So that's part one that basically builds up your data set to be able to be used for rag Part two is well, I got a customer. They want to know how much a blue elephant vase cost They come in and they ask a question, you know, how much is a blue elephant vase? So the next part is that you have to take that question and turn into a vector Because what you're going to do is you could do something called a vector similarity search in the database to find the things That most resemble your blue elephant vase and you'll see like, oh, okay. I have a blue elephant vase Here's the pricing information take that question give the you know provide that additional context on price Send to the foundation model and then get that response that we saw before So that's part two So the vector is important in both parts because that's basically linking that unstructured information whether it's the you know the information about the products itself or the question that's coming from the customer and It ties it all together to be able to deliver the answer to to the user. So that's pretty cool. So I'm We're all questions in the the end, please. Thank you so that's so that's rag in a nutshell and It's a fairly simple workflow but there are some challenges that you know are presented particularly with vectors and In the case of rag and generative AI the first part is how long it takes to generate embeddings Some of these things can take, you know, tens of milliseconds hundreds of milliseconds or depending on how busy the model is maybe longer And okay, maybe that's not so bad You know if you're doing this is part of you know asking a question But if you're generating it or you need to generate it in real time You don't want to go through your whole data set and have to generate these embeddings all at once You want to be able to do you'll be able to cash them somewhere The other thing is that they get to be quite large You know a lot of the models they return you know over a thousand dimensions here 1500 dimensions and the attribute the individual Dimensions of the vector are four byte floats So that becomes quite large very quickly like one of these vectors is about six kilobytes Again, you're probably like Jonathan like six kilobytes is not big But imagine you have a million of these that's already like 5.7 gigabytes that you have to store Now again, you might say like Jonathan that's not very big like I have like all these PDF documents that are much larger But if you think about it from a database standpoint like your average database row is nowhere near six kilobytes It's probably closer to 600 bytes, you know, it's not taking up that much space So you have a million of these you're already you're paying a lot more for space And you're like well look we got compression right we can like shrink things down, but not really for vectors Because these are a bunch of random 4-byte floats like there's no patterns that you can use to to shrink them down. There's different techniques There's quantization techniques There's principal component analysis where you can make the vector smaller But you lose information when you do that and that's part of the challenge that we're going to discuss a little bit later So All right, so we're already seeing that this is already pretty challenging, but what about querying them? The primary operation to compare two vectors is called a It's called a you're finding the distance between two vectors you might hear it as a similarity search and There's two things that go on here is that if you want to find the vectors that are most similar to your query vector in a data Set you have to query every single vector in that data set to find the ones that are most similar to you So again, you have a million vectors. You're doing similarity search You have to query all 1 million vectors, but wait it gets worse because when you do that There's no shortcuts. You have to query against every single dimension in that data set So if you have a million vectors each of them with 1,536 dimensions if a query vector You're doing a lot of calculations there so There's no shortcuts well There's one shortcut Something called approximate nearest neighbor search So the idea with approximate nearest neighbor search or again I must abbreviate to ann is that you can find similar vectors in your data set without searching through all of them Which sounds a lot like indexing from a traditional database standpoint And it is a kind of index because you are looking over a subset of the data set And again, if you don't have to like look at all 1 million vectors in your data set If you only have to look at a hundred let's say to get your answer. That's way better That's gonna be a much faster query and it is it's much faster than They call it like traditional nearest neighbor search or a traditional, you know Exact nearest neighbor searchers, which is the term we have today But because you're not looking over every vector in the data set You can't be sure that you're getting the most relevant results and that goes into this measurement called recall Which is your percentage of expected results? So take you know those you know ten orange blocks to the Sorry ten orange circles to the the side. Let's say those are my expected results that if I have a query vector I expect to see those ten vectors But I only return eight of those when I search the index my recall is 80% because I only got 80% of my expected results For people who've been working with databases a long time This is a very weird concept because you're kind of like wait a second like I put in a query I expect to get the exact results that I stored in back, but you're saying I'm not getting my expected results That personally tripped me up the first time. I worked with PG vector. I'm like this is wrong I'm getting the wrong results But this is how approximate nearest neighbor search is because you're you're looking over a subset of all the data in the set And you're making it best guess and some of these annexing algorithms are very good that you're going to get a high You're a high level of recall, you know 95% 99% But you might not get the best answers or all the answers that you'd like to see there So this is a very important concept to keep in mind as we go further into this journey So before diving into one of my favorite topics first like a very good thing to ask is when you're looking at storing vectors is Does vector storage even fit into your workflow? Like do you need to actually have a quote vector database to store the vectors or is it something where you could do it entirely in memory? You can use a library like feist to be able to calculate everything That's one thing that you do have to determine like you may not need to store vectors But if you do like part of that you know part of that math is like how much data am I storing? Is it millions tens of millions billions hundreds of thousands? You know and because based upon that there's gonna be all sorts of cost associated with it. I mean as we saw These very large vectors have a payload, you know six kilobytes for a 1500 dimensional vector is a lot If you have tens of billions of those you're probably gonna look for cheaper storage But if you have a hundred thousand your 250,000, you know, whatever it is, that's you know, that's still not that much and Then from there you're gonna decide, you know, what makes the most sense for me Like do I want something that's high performance? Am I most interested in a hundred percent of relevancy like I always need to do the exact nearest neighbor search And you know ultimately what is my budget and From there like once you understand which attributes are most important to you Then you can design you can make designs for your vector database, you know Traditional things like schema design, which is you know, my favorite thing when working with a database And you can determine like what's your indexing technique going to use do you need an indexing technique? You know, what's your target query time? Which brings us into postgres as a vector store So this is really cool right because you know postgres has a long history you know, I'm roughly the same age as postgres and Postgres has evolved through the years from being your traditional relational database to beings, you know geospatial database being a time series database being a being a document database and What's great is that now we can see postgres as a vector database and part is this because postgres from the get-go was designed to be Extensible you can add functionality to postgres without having to fork postgres and that's a really powerful concept And what's also really cool about that is that over the past several years postgres added the ability to build your own custom indexing techniques, you know without leveraging one of the existing indexing frameworks in it and This is very powerful particularly when we're looking at vector searches because one of the reasons why postgres has become popular for vector searches is that It can work with your existing client libraries again, you know, I'll make a confession in front of a bunch of database folks I like using RMS when I was an app developer. It was very easy and convenient But that's great right because now if I want these vector capabilities in my database I can continue using my favorite ORM and I don't need to change anything and that's powerful too and also because Likely in my application My area ML data is you know I want to keep down the same database as the rest of my data and being able to have that Co-location is powerful because I don't need to store my data all over the place and some suddenly have like a distributed data problem I can keep everything all in one place and that's very nice And then you get you know you get postgres with your vector database because postgres Is a lot of things for you that you might take for granted being an asset compliant database is very important You know, I you know, I always like to focus on the D of asset compliance You know the durability that I know that if my database is writing data to the disk It's getting stored to the disk, you know, I know you know That's a very powerful property as well as the properties around like visibility and you know how things ultimately get searched So just as a fun fact Postgres actually has been a vector database since the get-go The array data type was in the original was originally in the Berkeley version of postgres Arrays were created to help with looking of access control list within postgres So you didn't have to do a join out to be able to get those rules, you know They could stay they basically you could look at everything in one column now the rate the rate data type is awesome you know, it's very versatile it works with you know a bunch of the day types already within postgres and You know, there's almost like no limits with it as I say there, but It doesn't support distance operations however back in 2000 or 2001 of the cube data type came there and the cube data type was basically able to store Larger vectors of data actually you could index up to a hundred using the built-in Yeah, and with the built-in gist index you can't you could do it like exact nearest neighbor lookups very quickly Just a side note the gist index is what post gas which is that geospatial Capabilities for postgres I mentioned leverages to do very fast exact nearest neighbor searches now Cube is great, but it's limited to a hundred dimensions and again like when I say machine learning like 20 dimensions was a high amount of dimensions So like I see like you know, 1536. It's like what is that? That's a lot of dimensions But the fact is that we're limited because you know with the things built into postgres We're not able to handle these newer size vectors So this brings us to PG vector PG vector is an open-source extension that gives you all the capabilities you need to be able to do vector search in postgres for one It has a vector data type You know has a couple of indexing methods which we're going to go into it supports exact nearest neighbor search approximate nearest neighbor search a bunch of different distance operations and You know the the best part of it is that you get postgres You can take your vector data and co-locate it with all the other data associated with your embeddings and have it all in one place Now I'm rushing over this slide because we're going to dive deep into like all these different things You know during the remainder of the talk, but you know, this is pretty cool I'm like this is the power of the postgres, you know extension framework is that if there is a new a new kind of data pattern or Some capabilities that are needed you can build it in an extension And also you know as we learned at the AMA yesterday if there's something that you can't do in an extension And you need to expose something, you know in postgres directly to enable an extension That's a great, you know, that's a great, you know topic of conversation with people who are you know contributing hacking on postgres Now one thing I do want to address real quick is just you know understanding PG vector performance and You know, there's a lot of different benchmarks out there I could do a whole talk just on vector benchmarking But one thing is that you know, it is like relatively fast particularly with some of the newer indexing methodologies out there You know, this was like one test, you know I ran just to like look at throughput and in this case. I was actually trying to compare different processors But one of the powers of postgres is that it can scale concurrently and it can scale vertically pretty well And like this does apply to vector data that if I do throw more, you know You know more concurrence selects that you know at the data and you know I continue to vertically scale the hardware that it can actually go, you know pretty far So, you know, that's one thing to you know consider You know, we're gonna look at a little bit of some scaling techniques throughout the talk But it does perform pretty well particularly when you're looking over like a smaller space of your vector index now As we can see throughout the talk there was always going to be a tension between the speed of your queries and the recall in your queries and There's a lot of things that you factor into recall But just keep in mind, you know when you're trying to decide what is your target recall? That is going to impact ultimately the speed of your vector queries So I'm gonna take a quick drink of water. Unfortunately, I have the handheld mic So please bear with me a second and now I have modeled how I can drink from a bottle of water. So Distance distance is very important because remember the distance is the fundamental way that you're going to compare two vectors PG vector offers three different types. There's Euclidean or L2 distance, which is line of sight So think of that as like me looking at you looking at me You have cosine distance, which is angular distance, which is like and you have inner product which kind of captures elements of both All three are used. We've been seeing cosine distance has been more popular with some of the larger embedding models It's one of those things where you do have to try it out Now all these have like different mathematical properties to them I'm trying to make sure I can get to all my slides. So I didn't put a slide, you know comparing all of it but One key thing when you're designing a vector database or a vector indexing a search system is that you want to cheat wherever you can And when I say cheat you want to reduce the amount of compute wherever you can because they're costly operations And in particular a lot of these operations have division and division is like the worst of all the operations That's still to this day the most expensive, you know one of the most expensive compute operations and That's where there's this trick There's something called the unit vector Which is basically if you take all your attributes and you find the magnitude of it. It comes out to one You might also hear it as called as normalizing a vector So when peachy when peachy vector stores a vector when stores in the vector column that column is going to be stored as You know, it's gonna sort of the vector, you know as as you insert it but when you add it to the index it's going to normalize it or convert it to a unit vector and the reason you're going to do that is It's gonna let you cheat when you do some of the distance Calculations because you're gonna be able to eliminate some of those divisions and in some cases, maybe some of the multiplication operations Which makes it a little bit faster because remember like whenever we're doing these operations like any place we can shave off CPU the better I mean another note I'll just you know, I'll just say right here is like what you know One of the challenges with vectors too is that you kind of have to optimize like everything in your compute cycle CPU memory IO, you know I like to say it's just really a nasty data type because there's so little shortcuts to be able to speed things up that like Any shortcut that you can get like just take it and you know, this is you know This is a popular technique is storing the normalized version of the vector The other nice thing about normalizing is that you don't lose information But when you're looking, you know, if you were to visualize the vectors in space You could actually see them get like clumped a little bit too much together And that you know becomes a little bit more of an issue when you're dealing with an indexing speaking of PG vector provides two different an and indexes IV a flat and h and s w So IV a flat is the cluster-based indexing method we're gonna like look at both of these in depth, you know throughout the rest of the talk and What's nice about IV a flat is that? It's relatively simple to understand It you know, you know when you really when you think about it It's like hey, I'm gonna build clusters around a few lists so like define like where my centers are and then cluster my vectors around them What's nice is that these indexes tend to be very quick to build? But they tend to be you know You have to basically have all of your data preloaded up front in order to build them in order to create the correct vectors Okay, please hold your questions till the end of the talk. Thank you So with h and s w hs w is a graph-based method So what so a graph so you think about traditional database indexing you store things in trees Graphs are just a super set of trees here I like to call them like slightly more complex trees and the way h and s w works is that it organizes vectors into neighborhoods So the idea is that when I'm trying to place a vector in the index I'm trying to find the vectors around me that are most similar and I might say like find me the most 16 most similar vectors once it does that you create links to all those vectors around your neighborhood and You know what happens is when you do the search You know you trace these neighborhoods to be able to find like you know these like strong clusters and more likely than not You're gonna find the vectors that you're most similar to we're gonna see a demo of how these work What's nice about h and s w is that it's an iterative index where you don't need to have your data loaded all at once To use it you can continue to insert it But it does have a much higher build time than IVF flat So I'll give a I'll give the cliff notes like which search method do you know do you choose is a little bit complicated But there's one simple one that if you need a hundred percent recall You got to do an exact nearest neighbor search Don't choose an index because you can't necessarily guarantee you with any method you're gonna get a hundred percent recall If you need fast indexing you're gonna choose IVF flat and that's you know If you know that if you know that you're gonna need to rebuild your index constantly You know you're adding data in such a way where you know It makes more sense to just continue to rebuild and find the new centers use IVF flat The reason I put an asterisk is that the recent releases of PG vector have improved h and s w build performance significantly where? This is becoming slightly less true. It's still true But you know with some of the h and s w properties, you know, it makes more sense to use that and you know just being ahead You know from an active available standpoint h and s w is a lot simpler to manage it feels more like a quote traditional index and It has you know What's nice is that it gets this high performance recall ratio in the sense that You're able to very quickly find the vectors that you're most similar to and more likely than not they're the most relevant to your results Most of the vector day pieces out there offer h and s w as you know the the first index available There are reasons to use both you know This gets into a deeper discussion and I've seen people implement IVF flat in their production systems And they're very happy the recall and performance So in some ways, you know the answer is test to see which makes the most sense from you I'd probably recommend starting the h and s w because it's a little bit easier But again, you know you have choice You know I think you know that definitely helps particularly because There's a lot of other factors that go into recall beyond just what indexing strategy you use such as your embedding model So let's go into some strategies and best practices and I divide them up into five sections Storage the two indexing sections filtering which your test database folks is the wear a clause and one that actually worked on the airplane over here today Sorry two days ago, which is distributed queries, which is kind of exciting and scary so storage strategies in order to understand storing vectors and postgres we need to understand toast and You know, I can make the toast joke like no, it's not the thing that you put in your toaster, but You know in some ways it kind of is so toast is it's called the oversized attribute storage technique and The reason my toast is important is that by default postgres stores 8 kilobytes of data within a page the page is the atomic unit in Postgres that gets stored to disk and that's you know one of the ways that we're able to implement all the nice features around postgres But we have data. That's larger than that. You know text being like the first example and toaster is designed to be able to store Data that went beyond 8 kilobytes. I believe you can store up to 1 gigabyte of data in a single toast row I Think I think that's correct. If not, I apologize But by default if you have data over 2 kilobytes postgres toast it which is gonna happen quite frequently with these large vectors that you have a 510 dimensional 4 byte float vector it will be toasted and anything beyond that So let's understand like what happens when we toast it because again for like a lot of us Like we've never considered toast, you know other than the fact that just works up until this time, but it's gonna impact us So postgres has four different column storage types which impact how it gets toasted The first is plain which basically says store the data in line But keep in mind if you use that you're only allowed to store up to 8 kilobytes of data Which is roughly a 2000 dimension vector There's extended where data is stored and compressed in your toast table This was the default in PG vector before zero six zero, but keep in mind. There's that magic word compressed I thought we said we couldn't compress vector data and that's correct We're basically wasting time trying to compress it to figure out like oh no We can't compress it. In fact, it's actually gonna generate something. That's a little bit larger and there go We're not gonna compress it at all which led to the adoption of external as the default storage type in the newer versions PG vector which you'll store it in the toast table, but it won't attempt to compress it So we'll save some CPU cycles because again every bit of CPU we can save we want to save And finally there's main which just for completeness. It's here not super relevant to vectors Now why am I spending so much time on toast because it can't impact your performance? so here's a You know postgres has had query parallelism since version nine six and it's a wonderful feature and particularly if You have to do a lot of these vector computations in an exact nearest neighbor scan you're going to use it So I had a data set. I think it's a million. I can't read it's other a million or 10 million I can't I think it's a million, but it's like hey, you know You know do an exact nearest neighbor scan on a million 120 dimension vectors cool. I'm gonna give you six parallel workers So let's do the same thing on 15 dimension vectors and it says cool. I'm gonna give you four parallel workers That seems wrong right like wait a second these this is much smaller Why am I getting only four parallel workers here? And that's kind of like a gotcha because these vectors are toasted when postgres is doing the query planning It's basically seeing you know the toast table is actually a separate table So it does the query plan it looks at the main table It's like there's not that much data here because yeah Because like all the data is in like a different table and it's not considering that in the query plan so That's a gotcha and like we need to make some plans around that because One of the big differences between this and how toasters originally used is that the vectors in your hot path of your query Like it's being used you know for serious workload So there's some ways around this to make your life easier One is use plain storage if you know that your vectors not going beyond two thousand dimensions You can store it in line You know just make sure it makes sense with the rest of your queries because that is a lot of data The reason that we use toast is to speed up Queries that are in your main table so but if your vector if your vector query is you know The resin to etcher or the reason that you have that table it may make sense to keep it in plain storage Also right now PG vector only supports up to indexing two thousand dimensions So plain storage may make sense for that You know we'll talk a little bit about that that other point layer dirt or there's this parameter called Min parallel table scan size which can help induce more parallel workers So in that previous example, you know, I kept my vectors toasted and I set Min parallel table scan size to one and look 11 workers a lot more which is good because we can definitely benefit from a lot More workers in this case so The reason why you know, I spent so much time on toast is you know going back to my roots as an app developer This would not be something I would think about at all because again most things in Postgres it's like set and forget and just works but because this data is a little bit different and You know because of like the size of this data Like it helps to understand these techniques to make sure that we're optimizing our performance And there's probably a discussion of how do we make this simpler and it is a fair point And I think that's you know something I'm personally interested in helping out on So let's get into hnsw So hnsw stands for a hierarchical navigable small worlds say that ten times fast And as I mentioned the idea is that it's a it's a graph But a graph is just you know specialized tree Oh, sorry It's a super set of a tree and the idea is that when we search through things We're going to search through a vector space where we start with a very sparse The way that I could think of it. I thought about this the other days I think of it like a cake We have a cake of vectors But instead of vectors we are sprinkles and let's say I designed the cake where at the top of the cake I have a few sprinkles going down in and each layer I get more and more sprinkles till the bottom layer has like all the Sprinkles I'm basically tracing the sprinkles through the cake that like I find like the sprinkle I'm closest to here And I go down like oh, maybe it's here etc. etc. etc And it was actually kind of a very interesting algorithm It's been around for about seven years I think because it did it did basically help put folks or put like query vectors into the right space There's the key with hnsw is your build How you select your build parameters will ultimately impact how you're able to search over them There's two key build parameters here The first is M, which is the maximum number of bidirectional links between index vectors So, you know and think of M is like if I'm standing here. I'm being indexed I'm gonna build like links to like the 16 people closest to me in this room If construction is effectively your memory as you're going through and building the index that you're keeping a list of all The vectors you're closest to and ultimately you're going to choose your closest 16 Think of it as like your search radius Like this larger the search radius that you have the more likely you're going to pick the vectors that you're most similar to Now there's gonna be you know, there's trade-offs to this We're gonna see you know the different trade-offs in a few slides But instead of me trying to visualize myself as like a seven layer cake, you know Let me show you you know how in anime way of how a building hnsw index works So let's save a bunch of vectors and I have a you know vector. I'm gonna index that's an orange What happens during hnsw is you first start at the top layer You have something called an entry point and then you find the vector that you're closest to in this case It's this one what happens is then you go down to the next layer And there's a few more vectors and then you find you know You find ones that you're closest to and you might build some links along the way as well Because that's got ultimately help the search that happens and then you get to the bottom layer you might find a few more vectors that you're closest to as you build those links and Then there you go. You're indexed So again oversimplified example, but that's effectively how it works and again in order to do that It might take some effort because you're you know maintaining a list of like vectors that you see in you know in that EF parameter But when you do that work up front the payoff is in the query time So when you query it, there's one parameter you need to be aware of EF search And EF search is again similar. It's like how many of those nearest vectors are you maintaining in a list? The more of those vectors you maintain the more likely you're going to return a highly relevant search But there's gonna be a cost because you're maintaining a larger list that you have to look over That must be that value must be greater equal to limit by the way So if you have like limit 50 EF search 40, you're only gonna return 40 results So you're not getting everything that you need to see in the limit I think a newer version of PG vector put in like a check or a guard against that But I don't remember off the top of my head and search is very quick The idea of the search is you go you start at the top layer you find your entry vector And then once you go in you're gonna find the vector that you're closest to in that layer Go down to the next layer same thing, you know, I started there. I'm gonna find the vector. I'm closest to Go down to the final layer and then you know, I'm gonna traverse the graph and I find the vector I'm closest to cool very quick kind of feels like a bee tree search and in many ways it is You know a graph is just you know a larger tree So there's a few best practices and I'm gonna say these keep evolving and I think that's like one key point with vector searches Is that this is like a very rapidly evolving field? Even like when I first wrote this talk some of the guidance has changed you guys You can see about the the second and third bullet point the nice thing We did a lot of testing on the default values and like the default values usually work Though we're gonna see that you're based upon some a recent PG vector release We can probably crank up if construction a bit The other thing I just want to mention as well is that you know not all h and s w is equal Different vector database has have different implementations of h and s w some might use a library like feist to be able to Perform the algorithm and you're stored to disk PG vector actually implemented h and s w itself based upon the paper and actually chose a different version of the algorithm Then some of the other implementations This is actually in part to make sure we could optimize it for postgres storage Because what you'll see with like some of the some of the the query performance from h and s w like it does depend on How many times you have to go to disk how many times you have to go and retrieve pages from disk? The more that you have things clustered like on the same page the faster it's going to be because that's gonna be less work Overall as you're searching the data that doesn't mean like you know there aren't some other techniques We can do to help speed things up Particularly building indexes With PG vector zero five one you want to use this concurrent insert method where you'd start with it empty index and then Either do concurrent inserts or copies because that was gonna be way faster PG vectors zero six zero added the ability to do parallel build so you could preload your data and then do a power you know use parallel workers to build the index and You know based on upon some early testing that's looking to be a little bit faster than the concurrent insert method I will note that there was a commit earlier this week. That's targeted for PG vectors zero six two That's go work much better for very large data sets I was actually trying to build a a billion data set with like 128 workers and I was noticing like oh geez This is not going as fast as it could so Andrew Kane who's the you know the primary maintainer of PG vector Within within a few hours found it you know found a locking contention issue and solved it By the way, Andrew Kane is an incredible job with PG vector. I just you know want to note that for the record record But yeah, so I would say generally though the parallel build if you're able to leverage parallel builds. It should be a little bit faster Now this is also updated some guidance So you know as mentioned With you know, it's all for HSW. It's all about your build parameters and what you choose and before PG vectors zero six zero You know, I would say that you know, for instance on this data set It was like over a million fifty hundred dimensional vectors We start seeing diminishing returns on recall You know as we cranked up EF construction that one of the reasons why the default was good enough is like you're getting pretty good Recall that level. It's already point nine one like do really really need to spend a lot more time building the index to get point 93 I mean this is a difference of hours here but Re-running the same exact test Using 64 parallel build workers on a PG vector. Well, I think PG vectors zero six two one released Notice that we still kind of see the same degradation, right? Like, you know, we're not getting you know, you're going from like maybe like point nine one We call the point nine three But the time axis changed significantly like we're talking about minutes not hours and that's big right because like even if I'm still seeing the Same performance degradation, but I can measure it in minutes not hours That's an argument to say like hey like let me use EF construction to 56 instead of 64 Because I will get slightly better recall and the cost of getting that recall is not that bad So that's why this is I will say like this is a rapidly evolving area But we're definitely seeing you know performance gains and it is helping us to update guidance And just in general if you can build more Acre I might use the term accurate machine learning folks I like why use accurate here if you can build more accurate indexes though It makes more sense and again M is a parameter that helps with that as well that we can see you know This was the the just 960 dataset which tends to be Most most approximate nearest neighbor indexes don't do well with it But you know the parameter and which is the bi-directional links can help increase recall But notice like the time penalty you pay again like we're talking about hours, so You have to decide what's the most important to you when you're building this that are you looking for better Recall and you're willing to pay the upfront cost of building the index or do you want the industry to get built quickly and You'll deal with the fact that you might not necessarily get the most relevant queries though again relevancy may make sense for you So that's it in a nutshell for hnsw You know you have a few lovers building is the most important thing You can always increase hnsw if search increase recall, but there's gonna be intention it's going to decrease the performance of your queries and The more of your data you can fit into memory the more that you're able to The more that speeds up queries right like the less trips to disk generally the faster things are so at least I'll be able to keep Your index in memory Ideally if you can keep your index and table in memory that'll be the fastest But at a minimum you want to argue the index particularly because those could be that's gonna be the fast path So I'm gonna go through IV a flat real quick So IV a flat as mentioned it was a clustering algorithm It's basically decided Based upon the number You know they call it list other people call it centers You can call it buckets, but the idea is that you ever you already have loaded your data set and you're going to find Blist of vectors that you're gonna group everything around So for example, let's say I have you know a bunch of vectors in space some of the previous example And I say it you know I want list three. So I'm gonna find the centers and I'm assigned vectors to the lists Then when you search over it, you have this value called probes So if you set IV a flat probes to one that basically means that you're going to search one center And in this case, I want to find the three closest vectors to me So if I searched just one center, these are my three closest vectors But if you eyeball this you're going to say like wait a second like those aren't actually the three closest vectors Like these are the three closest vectors. That's correct And what you have to do is to get a more relevant search you have to increase the number of probes and That case when you increase the number of probes you actually do find your three closest vectors Now that's the trick with IV a flat is that you often have to increase your pro value to be able to get the results That you want, but it's gonna come at a cost and you know from what we've like observed it tends to be more linear Versus you know more of like a gradual like logarithmic linear, which you might see in a hnsw So there are some perform strategies, which is you know if you want to increase recall increase IV I thought probes But it does decrease performance I think that's been the general complaint with IV a flat that often you know based upon how things get Clustered that you do need to increase probes and impacts performance But it's kind of like you can actually have an IV a flat Index outperform an hnsw index because if you're able to get your list small enough and you actually get you know the The most relevant clusters you're going to outperform hnsw because it's good just be like very very quick queries But again, you know, I think the trout becomes like do you want to be spend your all your days? You know tuning a vector indexes or do you want to build your application? So it depends where you want to spend your time You can also induce indexing queries by lowering random page cost again as an app developer Like I had rarely looked at costing parameters in Postgres until I start looking at vector indexes And finally, you know you can set shared buffer or sorry not finally but you can set shared buffers again to keep you know as much in memory as possible and If you're looking over a lot of lists, you might want to increase the work memory parameter on a per query basis So work memory is basically how much of your working set you're going to keep in memory before it spills to disk because If you spill to disk there's going to be a penalty to that because you're pulling things You know in and out of disk to be able to perform operations So let's say like you have 10,000 lists and you set probes to a thousand You might exceed Postgres default working memory and you're dealing with that swap going between memory and disk So best practices So a lot of these I took actually from the PG vector repo in terms of how you choose the value of lists There's an art and a science to it The thing to keep in mind with IVF5 is that as you add and remove vectors from your data set That could end up skewing your centers Which means that you have to rebuild it to get queries that have the highest amount of relevancy But you can use parallelism to accelerate build times and just real quick You know the way that it originally worked in PG vector was that when you were doing the list assignment You would do a sequential scan on the table basically read every single vector one by one So imagine you have a million ten million a hundred million. That's going to take a while With parallelism you can basically break that up and have a bunch of parallel workers do the list assignment Which in this experiment, you know, I saw a 2x improvement It was a a million seven hundred six eight dimensional vectors and sometimes I saw up to a 4x improvement So again speed is good if you can build your indexes more quickly generally up. That's a good thing filtering Filtering I think has been the has been the hot topic for the past let's say three to six months in general for for vector databases and You know, I kind of chuckle again as a you know with my database hat on because This is what this is what postgres does you you put a where clause on something you filter everything out But it gets a little bit weird with vector queries like everything with vector queries It gets a little bit weird So if I add a where clause the first thing might happen is postgres chooses not to use the index and it's like fine I'm gonna do sequential scan So suddenly when you thought you had a query that was returning in milliseconds You have it returning in seconds because it scans the entire table and then does the you know the field You know, well, maybe it filters everything out and then it you know, you know, you get the result of your query It might you it might then end up using a vector index But you might not get enough results because your index might be on your entire data set now in the filter And it's doing the search based upon what vectors are your clue your closest to well Those vectors might not have the filtered attribute on them you know if if your cat you know back in this example if your category ID is seven and I return a bunch of vectors where a category ID is six all those are gonna get filtered out and I'm not gonna get the results They want or the filtering may occur after using the index There's actually one more thing that I need to add in here Is that you might end up using a different index to do the filtering like a bee tree on category ID and they do the Exact nearest neighbor search after which actually might be totally fine because if you get say like a hundred results back Just do the exact nearest neighbor search That's gonna be you know super fast and you're gonna get you know the the relevant results that you want And You know I think that's you know I think that's what it is I guess I got ahead the first point is that a bee tree might be the way to go here You might not need an an index and frankly like it's if you can use a bee tree. That's much simpler You also need to consider like how many rows the filter would remove is that if you have You know let's say your filter leaves you with 50,000 rows And you need to do 50,000 index comparisons at that point You probably want to be able to do that approximate nearest neighbor search And of course it always goes back to you. Do you want exact results or approximate results? And you know that's on you to decide for what you want with your application So there's two filtering strategies to consider today There's the partial index another great postgres invention we're basically index over a subset of your data and That will give you you know that will give you that effective pre-filter where I'm looking at all of all the rows that are Assigned to category ID equals seven and I can get the approximate nearest neighbor search And make sure I'm seeing all the relevant results that I want to see Or I can partition my data if you have a clear partitioning strategy But just be careful how you partition your data and do an approximate nearest neighbor search because if you have multiple values that match your filter within the same partition You'll run back into the same problem we mentioned before because you're searching over all the vectors in that index and If you filter out too many that have the partition attribute You might not actually get the rows that you want There is a patch actually there's several patches Proposive PG vector right now to do you know the equivalent of multi-column indexing which is another kind of pre-filtering Which might simplify this from an app developer standpoint Not committed yet. We'll talk about that in a second, but There's definitely more work on this you know I'd say this is still one of the areas that is being rapidly innovated on just on vector databases in general It's not not just PG vector So last but not least I'm actually call this looking ahead a little bit distributed PG vector queries So I'm of so I'm of the personal opinion that this is still scaling pretty well vertically on postgres But there may come a day where we need to be able to do distributed queries where effectively If you don't have enough memory on a single instance to be able to efficiently do your vector searches You might want to be able to push that down to other instances and then do a final aggregate on a single instance Now when you get to that point there's a few things to consider right like one like that memory issue I'm discussing but also your network overhead must be acceptable because you are going to pay some latency here Finally you have to be able to manage a multi-node system like that's not trivial You know you especially in production you need to worry about availability backups monitoring Software upgrades etc. And there's definitely a multitude of ways to do that, but as I like to say more databases more problems Now postgres does make it possible to do this out of the box using something called foreign data wrappers So here's an example where I do the setup for a foreign data wrapper to be able to query against multiple vector databases Sorry multiple instances that contain vectors in them There's a whole discussion on security here, which is probably a talk in itself In fact, I gave it to I gave a talk on that probably like 10 years ago. So skipping that here and then You have to set up what it's called a foreign table, which is a reference to your remote table Now what did not show you was that I actually set up these tables on the different instances and each of them had like 2.5 million vectors in them So there's an additional step here, but I kind of want to show the layout here because sorry I should have gone back here There's two key things to add when you're setting up the the foreign server Which is you're saying like hey like these foreign, you know These foreign postgres instances have the PG vector extension loaded and they're capable of asynchronous queries Because you don't necessarily want to be able to create each a base one by one You want query them all at once and be able to get the result back So I did that and you can see like hey, I'm actually able to do the foreign scan But actually I'd say this is hot off the presses because I'm still tracing it down I wasn't doing a synchronous foreign scan so you can see you know based upon the timing that it was cumulative when I did the final The final search The other thing I'm testing right now, which actually is looking very positive It's like this is impact recall and the answer so far seems to be no You're able to do this and still be able to maintain your recall targets, but Ask me later in the year possibly in May at a peachyconf.dev 2024 So the the thing I'm tracing down is like can we do the async foreign scan It might have to do with an append node versus a merge append node, but to be determined So looking ahead what's on the roadmap for PG vector Well, as I mentioned performance improvements for massively parallel hnsw builds that has been committed very much looking forward to that So this is the efficient pre filtering to be determined what method is used There's a patch for HQ ANN to be determined again if that ends up going in But again pre-filtering is the hot topic right now Allowing more data types per dimension, you know two byte floats one byte unsigned integers And what's cool is that there is proposals for how to do this right now in PG vector And this allows you to do something called scalar quantization with expression indexes where you can have your full Vector stored in your your table vector column and then Do a cast to like a you know two-byte float in your in your vector index And the powerful thing about that is that that lets us expand the amount of dimensions that We can index with PG vector That's pretty cool excited for that another technique that'll expand upon that as well as product quantization again being explored What we found PG vector was again things were you know kept scaling vertically fairly well That didn't make sense to add necessarily the quantization techniques You know based upon the performance numbers we saw and frankly the comparable performance numbers we saw and you'll being able to add parallel query The nice thing about hnsw is that we really don't need parallel query IV a flat could benefit from from parallel query hnsw To be determined particularly as you expand your EF search Parallel query could have some benefits, but we haven't really seen much there yet so in conclusion you know like JSON a vector is just a data type and There's all sorts of like different storage and search techniques to consider around it but fundamentally it's a data type and Traditional databases like Postgres are very good at doing these things with these data types as You add vectors into your applications should you so choose? There's always to be a design decision between your query performance and recall and actually I probably should add your index build time All those are going to be in tension And you're going to have to choose what your trade-offs are like you know as far as I know You're not be able to get everything you can try and maybe your data sets small enough or like you do get everything But as your data sets grow you will have to make trade-offs And you have to decide what do you want to invest in is it could be at your storage layer? Like do you need super fast storage or do you want something that's a little bit slower? You know based upon like just the pure size and volume of your data set Do you want to get you know the largest compute instance to be able to like run super fast? massively parallel queries and index builds or are you going to keep something that's a little bit smaller and you might feel the fact that you Do it need to make fetches to disk and then you have to pick your indexing strategy. What makes the most sense? It tends to be you know right now HHSW is seemingly the default that people are choosing But you have to decide what makes the most sense for you and finally this is rapidly evolving that you have to plan for today and tomorrow You know PG vector is pretty mature of I've already seen plenty of production deployments I've also seen a lot of kicking the tires, which is completely fine because I think people are kicking the tires on generative AI in general But you know there you do you need to you do you need to plan for the fact that it's rapidly evolving? Like I think the the foundational performance features are in PG vector minus like one or two and then from there It really is off to the races. I think we're gonna see you know even further innovations from that so Thank you, and we have a few minutes for questions Any questions? Are there significant differences between postgres as opposed to like chroma or pinecone that are worth mentioning? So I'm not I'm not trying to do like a This one I'm not gonna try to do like an apples to apples or apples to oranges comparison between the different databases But you know I think I'd say with postgres you know postgres just has you know 40 year development almost 40 year development history Really you know first starting with like how do I store the data making sure that I can store it safely It's acid compliant and you know there's a whole tooling you know there's a lot of tools already built up around postgres You know the other thing is you just get everything else with postgres I mean if you want to like build like geospatial time series vector database like all in one like that Something you can do it just based upon all the the postgres extensions that are out there Hello Can you please give an example of how to convert attributes? Let's say in an e-commerce site of a product into the dimensions and also Do all of the vectors need to be the same size the dimensions the same order? What happens if some attributes are missing from certain? Records, so I didn't understand the first part of the question But for the second part when you run when you build an index All the vectors within the index have to be the same have the same dimensionality You can have you can store mixed vector types within it within a postgres database even within the same column But then when you build the index you have to make sure all those vectors are of the same dimension So the first one was you showed in the example how you have dimensions of 0.5 0.5 0.5 Yes, how do you get to those numbers from the actual record that you're trying to compare? So is it like if I want to like introspect those numbers directly or is it like how did I get? How did I make up this number? Yeah, so that was just an example Contrived like what's happening with the generative AI models that I have like some text like You know vectors are the new JSON and I take that put it the embedding model And it's going to generate the the attributes But that's not that's basically about how the embedding model is defined and how it's able to do that That's on that's on it Just final question. Yeah, and I can answer questions outside after. Yeah, I was just wondering With your rag diagram When someone query the database in the diagram it also insert it back into the database I was wondering when you insert new data Do they have to rebuild everything for the vector or? So no One of the so even with both IVA flat HNSW you don't need to rebuild everything when you insert it And in fact with Postgres it's it's an online insert So as soon as you insert the record into the table, it's automatically get inserted into the index now IVA flat and HNSW have different properties around the insert because HNSW is iterative it's kind of like building a traditional B tree index where you insert it and You're gonna put the vector in its highest probability space of where it occurs with IVA flat Because it's cluster based what happens is that you will insert it into a cluster But over time your centers are going to skew So the centers you have represented are not necessarily the true centers anymore and that case you might want to rebuild in order to get better results Thank you Thank you. Thank you everyone. So the next talk is about Postgres 17 in this room in 15 minutes Yeah, is it on yes But yeah, we're gonna stand there and hold it. You can stand there and hold it Okay, I think it's time to start Okay, thank you. I'm done Well, at least this microphone works Those who saw the previous talk as well. There are some challenges with the other one But you know, there should be some technology challenges So I'm expecting my slides to give up about halfway through no just because they can just because you know the possibility exists Well Welcome here to all of you. My name is Magnus Hagander I work for a company called Red Bill in Pro. We're an open-source company in Scandinavia based in Stockholm in Sweden Myself I'm part of the Postgres core team. I am one of the committers who Will see I don't think I have a feature of my own in Postgres 17 yet But you know, it's not too late and I currently serve as the president of the board of Postgres Europe But We're here to talk about Postgres 17, right? Who's already using Postgres 17? Okay, good. Who's using it in production? Very good. How's it working out for you? Perfect So yeah, I post her 17 is at this point still the upcoming version of Postgres, right? The currently just and greatest that is supposed to be used in production is postgres 16. Who's on postgres 16 in production? Okay, let's see 15 14 See so far the numbers are going down. I'm worried when this trend is going to change 13 12 11 Okay, now we're getting into troublesome area 10 Earlier than 10 Okay, well, I don't need to tell you that you should fix that You probably know that already and and if it was easy you would have done it But yeah, as of today the latest supported version of postgres is 12 And the newest supported version of postgres is 16 17 is not yet supported, but 17 is the upcoming version, right? And the way that we work in postgres when we build is that we actually started postgres 17 in June or last year by branching postgres 16 Because the way the development works within postgres is we do all the new feature development against the master branch and then when we feel it's you know nice and ready and Stable and the new one we create a branch in this case. We call it release 16 stable and that's how you magically make your software stable Just create a branch name with a stable in its name and you should be good to go But you know jokes aside what that means is the moment so in June 2023 we branch We created the branch called release 16 stable and what that really means is You're not allowed to put any new features onto that branch anymore It's now bug fixes and maintenance only and the master branch the main development branch gets opened up for building new features And then starting adding the features that will eventually bring us to postgres 17 Then in postgres we work with something that we're called commit fests It's really just our way of doing iterative developments You can call it whatever you want, but we chose the name commit fests because nobody else had used it and of course we wanted our own name But the idea is we spend one month building and contributing features to postgres Then we spend one month reviewing and committing them onto the main branch So the idea is the first commit fest for postgres 17 was in July that would then If there would be things built in June and if it worked all the way through then we had another one in September We had another one in November. We had one in January and we are currently in the middle of the fifth commit fest Which is for March which will be the last one So the way we do it, you know these two month cycles until March and then we just stop So at the end of this commit fest 5, which is currently in progress when that one finishes That's the feature freeze for version 17 It's a very long commit fest We're gonna get so many features in Yeah, I probably did Because I kind of copied that and added one to everything So no it is it is March 2024. Yes And so and I think we were typically setting about a week a week to ten days into April is when the official feature freeze for 17 is Building on that we'll stabilize that and eventually get to a beta version usually around the May ish We'll see how many beta versions there will be we'll see when we get to release candidate But the general target is to try to get a release out in the September to October time frame as Usual as you know if you've been around postgres for a while That's when we usually try to get the new major versions out But that also means that as a disclaimer where we are now today may be completely different from where we are three weeks from now We know that nothing will get added that nobody has submitted yet But there are hundreds of things in the queue that we don't know which of them are gonna make it in So this will be a preview and the same thing of course right up until release any of these features can be removed If it's determined that you know, it's not good enough. It runs into a corner case that cannot You know reasonably be fixed within the time frame of the release This has happened a few times around it happened in the last version that some things were removed After the beta version was out and then they got removed again and some of them are back now in 417 So the almost current status that we're at because I think this number is from yesterday and you know For some reason there's not a freeze around when I do these presentations So it's hard to keep up and I updated this sometime last night not this morning But at that point we were looking at in this case as you can see you know as we all know these are the best ways to measure developer productivity We were at 1600 commits with about a hundred and fifty thousand lines of new code and About half as much deleted So where what is this then what kind of features are we what are we looking for? For those of you who've seen my presentations before you see I usually divide them into a couple of sections just to have a little Areas we got DBA and admin got sequel and developer Back up in replication and then we'll finish off about talking a little bit about performance because everybody loves performance And there of course every postgres version has some interesting stuff, but let's start with breaking changes Right. What are we breaking for you? Hopefully not that much, but probably something actually I'm willing to say hopefully something Because if you're not affected by any of this then you are not doing the things that you should be doing But probably not the first ones which is for example We have a different way of building Windows builds now you how you can no longer use our manual really ugly hacked up scripts You have to use me son That's a good thing We dropped AIX support who used who uses postgres on AIX Who has ever used postgres on AIX? Who has ever used AIX? Okay, there we go. At least something right And we've also removed this is one of those, you know double negations. We've removed the ability to remove thread safety I don't think anyone turned off thread safety for many many years So these things should hopefully not mean that much to you a couple of other things that you may have been using the admin pack Country module is gone All the functionality in that module already exists in postgres core But if you had code or if you had modules that were using it you just have to change them to use the functionality That's already in postgres core Has anyone ever used db user namespace? It's a configuration parameter that you can turn on Okay, I'm glad basically the idea behind this one was you could Have different users in different database in the same instance The minor drawback was you could no longer have encrypted passwords at all everything was clear text So I'm glad you didn't use this And this is why it has been removed. Did anyone use the postgres snapshot to old feature? Did you have a good experience with it? Yeah, it had far too many quirks and it didn't really work very well, and it was getting basically unmaintained So it's gone Okay, these are things that I was hoping you weren't using has anyone ever read from PG stat BG writer. I Hope all of you have done that Maybe not directly but made with your monitoring tools So a few things are gone the columns for checkpoints time checkpoints requested right time sync time buffers checkpoint back and Buffers back in and back and f sync are all gone from this view. So all your monitoring software will break. Sorry What it's going to mean, of course is that you need to upgrade your monitoring software to a version that supports version 17 But most monitoring software will be reading from this view and it's vastly different in this version The other thing that I hope is not going to break for you, but it might is That the search path Variable in postgres is now by default secured in all functions called in maintenance operations. What the hell does that mean? Well, the basic idea is there are a few other covers, but the simple case is if you have an expression index, right? You have an index on the result of a function When that function by default now when that function is executed postgres will reset the search path variable to just contain PG catalog If you need that function to be able to call fun other functions That are not in PG catalog You have to explicitly change set the search path on the function otherwise it will stop working Right because it will not find the functions Previous versions of postgres would use whatever search path you were in at which point the maintenance operation could become very It's very easy to build yourself a security foot gun Where you could basically allow code execution as other users by placing things in different schemas and stuff So that the hard code thing is if you don't do anything Everything will become secure because it will restrict it to PG catalog and regular user can never put functions in PG catalog But it also means that if you are using expression index on your own functions that are calling out You will need to go and explicitly set The search path for them. Hopefully you were already doing that In reality, you probably weren't And now you will so if you are using again expression index on Custom functions that call out to other functions. They will likely break. You just need to investigate it Unless you're doing something really dumb Which you shouldn't be doing then all you need to do is to do the you know alter function set search path And it will work again, but you've just changed what the hard coding value is So these are basically the breaking changes I found so far. Let's talk about the new stuff instead Right, you didn't all just come here to you don't upgrade to 17 just to get things broken, right a Little bit. I mean some things are actually good Some of these breaking things are good, but on general no So let's start talking about the area of DBA and administration We'll start with a small one that I think people are using it are gonna like it a lot who's using Let's start by who's using psql? Surprisingly few actually do you know that in psql you can do this backslash watch Which is the same as as watching you need like run this query over and over again and show me the result The new thing here is you can just give it a parameter It says when it stop when it drops below So if you run your query and you do this, you know backslash watch MXl2 it'll run until the number of rows returned is less than two and then it'll stop So it's kind of neat if you're pulling some sort of a background job who goes Who's ever done that you know pulling you know PG stat progress vacuum put it on the backslash watch Vacuum finishes and then the interesting data starts scrolling off the screen and out of your Scroll back buffer before you notice. I've certainly done that So small things very useful Other things that I think are going to be quite useful, but but are fairly small is there is yet another time-out We've had many many time-outs in postgres you have statement time-outs and session time-outs and idle in transaction time-out And now we have a transaction time-out It is really what it says if you set transaction time-out to one minute No transaction is allowed to run for more than one minute regardless of what it's doing Right a statement time-out applies to the individual query, but this one you can say One minute for a transaction whether it's a million queries or one query. We don't care stop after one minute The typical user or a typical use case for this can be if you have you know You have a classic web app you got a web request coming in you're gonna time out that web request at some point anyway And then the queries are gonna keep running But setting a statement time-out for that is really hard because you might have multiple queries Right now you can just match up this time-out to set it to you know a second or two longer than your web server time-out And then it'll just clean up the old queries when there is not going to be anyone listening for the result anyway All right, excuse me if you can you define this on a per user basis? Yes, it's like the other time-out sessions. Yes Can you change your procession? Yes, you can change it pretty much everywhere in the same places as the statement time-out Who's using event triggers in Postgres? Okay, not too many so event triggers are you know the triggers that normal triggers trigger based on data event triggers based on well events But you can have them fire on things like create table You can have them fire on DDL and log it or prevent it and things and At the simplest you can now also they fired those you can create a trigger whenever re-index runs and make things happen and The the more debated one is you can also have a login event trigger Which means you can now get a trigger that runs when a user logins to the system or as we like to call it a footgun extraordinary You need to be very careful if you write one of these because if your event trigger crashes for example Nobody can log into your database including you so you can no longer fix your event trigger so Yeah, I mean Actually, you can because you can also know there is a global parameter We can just globally turn off all event triggers so that you can get in and fix your login event trigger But yeah, it is you you can do classic things like if the event trigger writes anything. No, believe it. I will be Sorry, nobody will ever be able to log into one of your standbys again Because the event trigger runs it tries to write you can't write anything on a standby so it crashes So you can do very good things with login event triggers and you can do very bad things It is a big footgun, but if it's used right it can be very powerful You can do you can get connection auditing at a level that the built-in one isn't enough you can add policies of You know who can do what we can log into what assigning roles at login There's a whole bunch of things you can do but be very very careful Because again something goes wrong and it goes wrong for real and you're stuck Hopefully everybody by now has learned the idea of wait events if you went to Jeremy's talk yesterday He did a good example of why you should be looking at your wait events They've been around in Postgres since version 10 we renamed all of them a couple of versions ago just to confuse you And now we renamed them then but only now did we actually create a new view called PG wait events that will actually tell you what they are So when we rename them the next time you can find them But now so basically you now have a view that's called PG wait events It's really is you know the event type the name and a small description so that you can find it Obviously for the version that you're in and one other important reason for this one is that wait events can now also Get custom wait events for extensions Previously there was a single wait event for extensions that just said it's waiting on an extension But as of 17 an extension can register their own wait events which will then show up in this PG wait events So that you can see what it's waiting for inside of the extension as well Another thing that I always like about Postgres We had someone mentioned was it in the AMA yesterday that you know the new the statistics features Always good new statistics features in Postgres. We try to bring them to every version We have new interesting parts to the statistics system in version 17 as well Well as I'll just bring this one right back. We also broke this as I just mentioned right We removed a whole bunch of columns from PG stat PG writer when you think about it None of these statistics are actually about the background writer and that's why they were removed instead We have a PG stat checkpoint where we put them back. So it's not that we took the numbers away. We just move them So you can't find them But the other more important part is as you can see in this example here You've got these numbers of time then requested and your times and you see restart points now get separate statistics Restart points in simplified being the checkpoints that run on a standby or the checkpoint that run during startup how many of those are happening and Splitting those apart into different values. So the old value is BG writer Background writer statistics are still in the BG writer view and checkpointer statistics have basically been moved over to PG stat checkpoint So again, if you have monitoring tools that are using these then that's the way to go some of the most of the The other things that were removed here these buffers Are moved into the PG stat IO view where they were added and Or in 16, but they were just not removed from the BG writer view at the time We have some additions to PG stat statements, which I'm hoping you're all using PG stat statements or have a really good reason not to One of the myths that you can get the local block IO statistics per query So the local cache in the back and like temp tables and stuff We didn't have the IO timing for it. Well, now you have those local block read time and write time And you get the entry time when a query was first recorded by PG stat statements in its current entry So the first time will go in in the stats since Now this can be a query as you know PG stat statement tracks a certain amount of statements or a certain number of statements and A query might leave the statistics and come back Right because it's no longer in the top five thousand or whatever you have and it will have the same query ID But it will then get a new Statistics since since at that point the statistics are lost There is also a min max stats since that's related to the fact that you can now choose to reset all the Statistics for a query except for the minimum and maximum time So you keep the minimum and maximum times separate from the other statistics and at that point these two numbers can go Two different values if you're not explicitly doing this reset yourself at different times They will always end up showing the same thing the query jumbling as it's called in Pg stat statements now properly normalizes call if you're using actual stored procedures in Previous versions if you called in this you know you can see I didn't have a stored procedure in my set test database So it's just dummy But basically now it can identify that you know the parameters are actually parameters and normalize them to dollar one dollar two Whereas in previous versions, they would it would see every call statement as a different statement when it had different parameters Which can be very annoying Because you can rapidly explode and be like half of your Pg stat statements or just different versions of of a call statement Pg stat vacuum progress Now shows index progresses So when you are vacuuming a table, it will not tell you within the individual index how far along it is But it will tell you so in this case is here have five indexes three are done So it basically takes the face of vacuuming indexes and gives us some more detail as it's running To give you a little bit of a clue how much longer you're gonna have to wait Because usually when you're at the point of looking into this you are waiting, right? It was it was slow enough that you felt the need to look at it So that can be a nice Further input into that one Okay, leaving statistics, let's talk about copy. Who's ever seen an error somewhat like this it goes off But you know I copy I copied my little dummy table from a test file in CSV and it goes down There's a boom invalid input syntax for integer now Usually it's not online to Right usually you've loaded a few million lines and then it happens So copy in version 17 has gained the ability to direct and act on errors What you can say here is you can say we see here copy Tommy from format CSV on error ignore And as I copy it now it just as one row will skip due to date type incompatibility for example now in 17 The only on error operation supported is ignore But the idea in the future is to be able to do you know on error log or on error Write it to a different table that where every column might be text or like to allow you to Proceed through it in this version all we can do just ignore it, but you know depending on what your data is That's a good start But the the goal coming hopefully then in Postgres 18 is to be able to actually send the incompatible data somewhere Sorry Sorry, it will be Any error? Yes Yes, or any I mean any error that's based on on the source data if you run out of disk space It's your your screwed But like wrong number of columns or you know batting yeah Yeah, all our of that type it will be able to ignore. Yes, Jimmy If one line breaks the rest of file I think that that is in the end going to end up how it breaks it like if you get like a runaway quoted string or something, you know all bets are off You're then you're probably just gonna fall into a different error later. And yeah, so it'll depend on on exactly how But if you do if it's single line CSV one line tends to not break the next one except possibly for the quoting errors And and yeah, they're gonna go weird. Whatever we do. Yes No indication of which line no Because the idea would be then you'd have to log like every one of them And there might be millions of entries that would be in this upcoming ability to say for each line log Which we don't have at this point Okay, speaking of things that were in postgres 16 and then they were not in postgres 16 and now they're back a New permission that is called the maintenance permission You can grant this to a non So to allow somebody who's not the table owner to be able to do things like vacuum Analyze cluster reindex and refresh materialized view unlock table Why was this removed in postgres 16? it was removed because we did not have this feature to secure the search path for expressional indexes and In that case this could have been used to gain higher privileges in the database and to be exploited as a security issue therefore it was removed the Search path lock-in was properly put in and then it was put back in now that it can be done in a safe way And this one works similar like you can grant maintain on a table to a user Then they will get this permission on the individual table or you can grant this new role called PG Maintain to a user and then they will get the maintenance permission on every table So you can create a special user that can run all those lovely vacuums for you Who loves dealing with locales Okay, cool. Let's add another one. It's not actually another one But postgres 17 will come with a built-in locale provider So, you know previous versions of postgres for a few years at least have been have had the libc provider and the icu provider And now we have a built-in one Big so you get a little bit more to choose from this is only for the C locale for now So postgres will just bring its own version of handling the C locale It is faster than using icu. It is faster than using glibc The most important part is that it is stable or at least if it's broken then we broke it and then we can fix it within postgres It does not for those who don't know the C locale is the one that basically is not a locale, right? It doesn't sort things correct But it does sort things consistently and a lot of the time. That's enough, right? If you're just indexing and texturing and you're looking it up by equals lookups It doesn't matter which order it's in the index as long as it's consistent And that's exactly the problem with particular glibc locales is that they they're not necessarily consistent Who has been hit by the locale problems in the glibc upgrades? I feel your pain. I've done it too. Who has never heard of the locale issues around glibc upgrades? That's very few people who haven't heard about it. Who thinks they're not affected by these? Okay, it's almost a trick question if you think you're not affected by them Or and if you have a or if you've never heard of them, you should really go to Joe Conway's talk That's right after lunch here He will talk about the problems in detail that these hold the built-in locale provider is not a fix like a Perfect fix for this. It's a step on the way It's infrastructure, but it can be used already right now Okay, let's move on take a look at some of the more SQL exposed Things and more maybe a little bit more developer oriented the things that everyone's always been waiting for right? You can now actually convert things to binary and octal sweet, isn't it? Was it I think it was 16 that we added the ability to actually specify literals in binary and octal But there was no way to go the other way well now there is Very simple, but still useful If you are dealing with Timestamps and intervals we can now hand have infinite intervals This was a little bit weird because we could have infinite timestamps, but not infinite differences between two timestamps So if you were to compare two timestamps and one was infinite you would get an error Now if you do that like if you did this infinity minus now you now actually get infinity That's you know, it's kind of obvious and it's kind of annoying if it crashes when you do that. So Simple addition that just plugs a bunch of holes where previously you had to manually do it You know case when x is in equals infinity then return this else return that If you're using partition tables identity columns now work properly as we'll say it has to do like when you attach and detach tables You could get into some weird positions with identity columns That you don't anymore Postgres 17 will support what is officially known as temporal primary keys and unique constraints This is again one of those things that doesn't technically add any new features to postgres But this is part of the SQL standard And it's a way to deal with any of the SQL standards called temporal keys basically it allows you in this case to create a primary key which has to have an First a column that is not temporal and then you can say valid without overlaps where valid is a range type Now in the SQL standard they don't they talk about periods Which is basically a range type for time or timestamps and they don't support the other kind of range types In postgres this will support all of our range types, but you still have to have one column that is not a range type in it And under the hood this will create an exclusion constraints Postgres had exclusion constraints for ranges for a long time But we couldn't use them as an official primary key And we couldn't declare them as a unique constraint because they would be an exclusion constraint So this is sort of a subset of exclusion constraints that can be declared as this Which will be SQL standard compatible and therefore it'll work in like when you're populating IDEs and Data types and things in applications out of it. Yes. Oh, I didn't do this somebody else will do Yeah, there are we'll see we're slowly moving towards this so piece by piece Yes So yeah, so this this this part only gives us the key that the ability to use this as a primary and unique key The SQL standard also has the ability we can say, you know, select this as you ask of a certain time We don't have that syntax yet So today if you just query it by ID today, you will get all the rows back and look at it So you yeah in in reality, you need to also do the where clause on the validity time How many cookies are needed to fix this? Yeah Yeah See Yeah, it is progress is ongoing. There are patches out there for the actual Code to do as of time. They're not done yet, but you know 18 or 19. Hopefully they're they're actively being developed. Yes Time travel was in postgres and we took it out The problem with that was the time traveling postgres was not optional and It was ridiculously slow But yes, some of the things that you are doing in in these temporal parts of SQL really They're solving the same problem that that postgres and elastra did back in those days. Yes Okay, moving on There is now a new function in live pq called pq change password I bet most of you aren't really using live pq But again, this becomes infrastructure that can be used by most other drivers that are sitting on top of it This functionality used to be in psql in psql in the client It's now moved into the driver which will then hopefully make it available to the other ones What it will do is it will make you stop putting alter user set password or create user password and put the password That will then drop out in log files and things like that This is the equivalent of the backslash password function in psql Which will you know encrypt the passwords properly and so that what goes in your log files is not a clear text password So again, we'll we'll have to wait a little bit to see this bubble up into the next layer drivers But at least now they're available to do that and Avoid leaking too much of the passwords into the log files now Obviously leaking the passwords into the log files is only a small part of the problem You still need to treat your postgres log files as Security sensitive data because there can be a lot of other security sensitive stuff that ends up in the log file at some point Certainly at the application level security sensitive stuff The json path support in postgres has been I'll temporarily use this one there we go so yeah, so Jason path has gained a whole bunch of new operators which mostly are I See converting between and then put data types within double quotes because really Jason doesn't really have data types You can convert between time stamps and things Inside of json path like you can convert to strings you can turn a string into a boolean and thing to be used in the Jason path query language inside of it as well So again, there is no sort of really Revolutionary json path functionality. It's just chipping away at the small missing parts to gain Full compatibility and to give you the full power of the json path language That's me. Well, it was someone. Okay Let's talk back up and replication. I well everybody loves replication. Everybody hates backups, right? I think that's the usual We've got a fair amount of good stuff in both of these cases that I think are quite exciting First of all, I start with PG dump because it's neither backups nor replication But it is dumps and they're still useful, right? The new thing in PG dump is you can now specify a list of what you want to do in a file instead of in a massively long command line So in this example here, I say, you know include table foo include table bar Include with a wild card in a schema But exclude the data for this particular table making it just the schema dump like you could do all of this on The PG dump command line before but once you start adding a few number of tables here that becomes very very un-maintainable So it's useful in that way and you just add filter. Okay, I'm gonna switch back to the other microphone. Here we go Yes, still works new batteries. We're back at it It's like one of those, you know the formula one pit stops, right? It's about the same thing So as I said PG dump a great tool not backups What is a good backup tool PG base backup is pretty good backup tool, right? At least if you're Can live with the limitations that it has one of the limitations it's had is that it can only do full backups Well, no more PG base backup in 17 can do incremental base backups Yes So I agree wholeheartedly that our documentation should not should not say that PG dump is a backup tool But I will continue to fight this windmill It's a hard fight everyone don't agree. They're wrong So PG base backup now can do incremental backup What is this moon? Well, it will back up compared to the previous backup only those disc blocks that are changed Which hopefully means that your backups will be much smaller I mean that's the idea if the entirety of your database gets changed between two backups There is no point in using an incremental backup, then you should just use a regular full backup Now the way that this works in PG base back as you've already known if if anyone in here is using PG backrest You know, you've been able to do incremental backups in PG backrest for many years and they were used to be at this granularity of a gigabyte file and Then they added block level incremental backups a couple of years back but still the PG backrest way of doing it is it reads the whole files and Basically compares the check sums to figure out what has changed PG base backup create has a new process in Postgres called a while summarizer Which you have to turn on you say summarized while equals on it'll start another background process that basically listens to the transaction log the while process and Picks out of it and say oh this block is changed Oh, this block has changed though this block has changed and Generate summarizer files that PG base backup can then use and say okay. Well from this to this Here is a list of all the blocks that have changed because you can get that information from the while and then it will back up Those specific blocks so the difference can be said that when you run the actual incremental backup with PG base backup It's faster. It uses a lot less Power of your machine because it doesn't have to read all the data Instead you have this while summarizer that runs all the time that costs you a little bit of performance all the time Versus more performance at the one time. Which one's going to be the best? I have no idea I don't we need a lot of more real-world experience before we can figure out exactly Which one's going to be right and wrong? I'm pretty sure it's going to depend on the scenario right in some scenarios the backrest methods can be faster in some scenarios This one's going to be faster Then you set you can't set a parameter called while summary keep time which tells Basically this while summary so how much data should I keep around default is 10 days? That means you can do incremental backups going back for 10 days So for example, if you do a full backup once a day You can do incremental every day and you're fine with 10 days But if you do a full backup every two weeks You don't leave this at 10 days or you will have no backups. Yes Yes, yes PG PG base backup is always a binary backup And that's one of the reason like you couldn't do this on a logical dump. It has to be on the binary backups And the idea here, of course when you then so you run your first backup You just run a regular base backup PG base backup give a directory When you then run your second one you run PG base back and you say dash dash Incremental equals and you point to the backup manifest of your previous backup So if you point it back to the full backup It will back up all the differences since the full backup you can also point it to your previous incremental backup at which point You'll get a chain of incremental backups That will work and then in the rest you just tell it to okay write this one over here And it will read the manifest Get the wall data out of that go talk to the wall summarizer and figure out which blocks have been changed in between them Now to restore this when you restore a general PG base backup, right? It's very easy. You just copy the files back and you're done That obviously doesn't work when you're doing them incremental. So for this there is a new tool called PG combine backup Where you just say, you know PG combine backup into this directory use the full backup and then the incremental backup then what goes here in the combined will be exactly what it would have looked like if the incremental backup had been a full one like it'll Put them all together and if you have multiple ones, you can just add them like a whole bunch of Haven't long chain of things if you had incremental every day. You want to back up to Thursday Well, you start from the full and then Monday incremental Tuesday incremental and etc And it will go through them all and generate one output. That is then the one that you just Restore as you did a full backup directly. Yes Yes Can you restore director? Yes, you can point PG like you can combine the This slash backup slash combine could be you know your var lib pg SQL data if you have cleaned it out first Or if you're restoring it to a brand new Somewhere, yes So if a block has been like if the block is in both the full and both of the incrementals For example, will it be written more than once? I don't know. I Would guess it's probably written more than once But so I think what it is if it's so in this while summarizer It will get written It'll it'll get written multiple times, but it generates basically a bitmap where it's only listed once Because so it's only listed once there, but if it's in I don't think the combined backup tool Is smart enough because it would potentially need a very large amount of memory to keep track of that So I think this one will just write it multiple times But I am not certain so Yes, so yeah, so the wall summarizer is Only needed to take the incremental backup not to restore it So when you're running this backup command if I'm pointing this to the full backup Then I need the wall summarizer to cover the time back there But if I had this scenario then when I took this incremental backup that would point to the full backup But this incremental backup points to this backup So at that point when when this one is taken the wall summarizer only needs to have the data back to here And then it will follow the chain across in the restore. Yes To start from a new full backup. I believe you need a new base backup To like start a new completely new chain, but you can have and you know There's no limit on how long your chain can be only of course it will take longer to restore The more you have changes in there. Yes Yeah, exactly. I think yeah, I think also the question was can you use the combined as a full bug? I don't think you can but again, I'm not sure it might be possible, but I don't think you can at this point Okay, so let's leave backups. This is I mean PG dump is not backups Some people think it is replication is definitely not that what about upgrades? There's somewhere in between, but there are some interesting things Postgres 17 will be able to preserve Subscriptions fully across upgrades. So when you PG upgrade today all your logical replication subscriptions go boom And you have to start over You will no longer have to do that. You will be able to upgrade without rebuilding your subscribers. That's kind of neat, right? Of course, you will only be able to upgrade from version 17 Because it's it's required on on that side as well, but you know, we like to prepare for the future For now, you're still in in the you need a new node that runs version 17 and logically replicate into this node But this is one of those things another classic issue with The logical replication is when you're combining logical and physical replication, right? You have a high availability cluster for Something that publishes logical replication. It fails over and now everything breaks To do that postgres 17 adds replication slots synchronization between the nodes As it basically means between physical replicas you can synchronize your logical replication slots. So today your Replicate the source of your replication publication really is an individual node But with this you can conceptually make it your ha cluster can be now to do that You need a multiple different things. You need to enable something called failover on each replication slot That's either done when you're manually doing PD create logical replication slot If you're you might be using this for something that is not postgres on the receiver side Or when you say create subscription, you just give it the flag failover on Then you need to enable sync replication slots on all of your standbys So that they will be receiving this data And then you need to configure standby slot names on The system as well because that's for postgres to know that it has to wait for these synchronous Workers that run on the standbys to get the data Before it's allowed to invalidate the data out of the slot Otherwise the problem would be if your logical replication runs faster than your physical replication It and then you failed over you would be broken again Now usually logical replication does not run faster than physical replication But you could have a temporary hiccup on the network to one of your standbys or something So you do need to set all of these and what will happen in practice is that this The sync replication slots when you turn this on it'll start an extra background worker that will just periodically synchronize the state of these so that when you do a failover your Subscribers can just connect to the new primary and what data they need it will still be there Okay, let's talk a little bit about performance. We all love performance Postgres is Almost touch that almost always faster in the newer version you will always be able to find something that's lower But as a general rule there are as usual there are a lot of infrastructure features That are really interesting to postgres hackers the giving access to you know new types of Structures for arranging tables internally in memory and things like that We don't necessarily care about those as an end user to an end user to us. It just looks like things are running faster That's great. We love it Signaling out a couple of specific ones copy can be made significantly faster I think the numbers so so one interesting one is specifically if you're copying data out There's a function called uuid out that's called every time you write the uuid out a string And if you have a sim if you have the Specific case of a single table with just a new uid column and everything is in cache and you copy it out It's now 60 percent faster Like that's a non-trivial speed up and obviously you don't have quite that ideal case in reality But it will be much faster and also simple things like if you're actually using copy to It would postgres would actually convert the encoding from The from itself to itself if the coatings were the same it didn't realize that they were different It does now so copy to in encoding cases. We're also just simply be faster Some of the more advanced things that are getting in there is something called self-join removal Which basically means if you run a query where you join a table to itself on Save the primary key in the simplest case and you just query one column from each of those two tables that join is completely unnecessary right Because you can just read both columns without joining in the table And postgres will now detect a lot of cases where this happens and just remove the join and get the column from the One scan of the table instead This is I mean you can look at in goes like the answer for a long time has been well Just fix your query like don't do that stupid join right but a lot of us are stuck in cases where we have automatically generated queries They are ORM queries or reporting system generated queries and now postgres will just just detect this remove the join and it'll run faster Another thing that I think and I was looking I was even surprised myself was going through the commit messages And I've read this and I found what I think is the best part of this feature in like the fourth paragraph in the commit message the sort of a side note thing And this is postgres will now deal with redundant not null calls as it's called much more efficiently So just setting the stage here. I'm if I create a table here. I have a single column. It's not null I insert a thousand values, right? whichever I do there If I run this query on that in postgres 16 and I explain select star from who where a is null, right? Or sorry, where we a is not null That's kind of a we know that a is not null. It's defined, right? Here's how a postgres 16 will run this query and I also realize this is a different test data because that was a thousand rows And this was run with 10,000 rows, but it works the same way Now if I run this one in postgres 17 it runs like that You know, that's a pretty simple thing. We no longer need this filter. No, it's gone We can just scan the table. We know that every row will match Now this is not actually the part of this one that I find cool, but this is the part that's you know, the headline feature for this What if we run the query like this select star from who where a is null? There is a not null constraint on this table. How does postgres run this in postgres version 16? Well, it runs a sequential scan across the table filter a is null. How does postgres 17 run this query? One time filter falls result This the again auto-generated queries Often don't realize this in fairness are my queries might also very well include things like this every now and then But I've seen it many times coming out of systems that are Auto-generated queries of an ORM system where it literally does an is null query where we know that can't happen And now we'll just detect it and not run it So I actually think this is Is the more important part than this? But you know, they're both good obviously and avoiding any work that we can avoid it based on the fact that we have declarative constraints And again, this is yet another reason that you should remember to put those pesky not null constraints on your columns Not only because once you get nulls in there everything bugs out when you didn't expect it. It will also run faster That's an easier cell Most versions to postgres include some enhancements to the parallelism It's not too much around the parallelism in 17, but you can now create index can now be parallelized for brin indexes Previously you if you were creating in a brin index, it would only use one CPU now It works the same way as B3 where it'll use Maximum parallel maintenance workers number of workers to create brin indexes as well The SLRU caches does anyone know what SLRU caches are in postgres? Okay, a couple of people pretty much all of them are postgres hackers and you know what they are. Do you also know what they mean? It's actually that part is much simpler it simply means a simple LRU so simple least recently used cash They're just like separated special caches the thing is there's been like one And now they're all divided up into separate cash banks That's not supposed to be an eye there. They're just divide cash in banks Which with separate lockings and you can now configure each one of these separate SLRU caches to a different size if you need to and these are things like you know multi-exacts in postgres Transaction ID caches these like very low level caches can be configured individually This is the same We already had something on PG stat SLRU which will tell you things like the cash hit ratio across the different parts of it But you couldn't actually control it before and now in 17 you will be able to override the defaults for these cash sizes And control them specifically now obviously like most of those things that like most of you are probably not gonna have to do that But particularly on like really heavily loaded system in with high concurrences Increasing some of these caches is going to make a potentially a substantial difference around the things we're doing So that's the list of things that I am planning to go through there's obviously always more and specifically now that we're only about Halfway through the final commit fast. There will be more things There are more things already now that I just decided not to include because again, what was it 16? thousands of commits There will definitely be more there are many many smaller things their performance improvements All over the place. It's no way to go through them all but This is also the perfect time for you to help out All right, we haven't released it yet. We haven't even put the beta out there yet, but you can actually get both RPM and yam and apt packages all the snapshot versions you can put them on your systems And you can run them maybe not in production. I mean you can I wouldn't recommend it But like if you have testing systems If you have automated tests, please throw postgres 17 at them and see what happens. Let us know if something broke Do you have performance tests for your applications? Run them against postgres 17. Hopefully everything will be faster Maybe it won't if you can let us know that now rather than after the release the likelihood of this getting fixed before the release Increases drastically So please do download these help us run these tests. Let us know how it works out. Yes Yes, I believe these packages are a certain able Devrem says yes for the RPMs. Do you know if that's true for the WM packages as well? Okay It's difficult to compare performance straight up between them in that case. Yes that is true But but yeah, and you can compare plans and you can certainly compare the fact that things work That your data comes back the right way But yes, that that is a very good point that from a performance perspective They're not compiled with exactly the same switches as the production builds So you can't do a direct comparison between those So yeah, please do grab us those help us out with the testing help us out reporting what works and what doesn't work You are the ones who have the real world applications Our tests are mostly done on you know synthetic workloads. We need to see this with real workloads And once you've done that hopefully we will have a Postgres 17 ready for you after summer With that, I thank you very much. I think we're all running out We're all running out to get to lunch I guess I have about two more minutes if we have any questions for now And if not, you can always catch me in the hallways later Okay, then off to lunch it is. Thank you What else to say You like tune this like to adjust it for the timber of my voice is it Test one to test one to that's a little bit on the quiet side. So I'm gonna bump it up just a touch All right testing testing one two Testing one two just someone go to the back of the room and see Testing one two testing. Hello. Hello. I mean, I can I can bring the mic closer, but it's more comfortable right about here Okay, testing one two Testing one two. All right. Yeah Test one two test one two. Yeah, this sounds about right so All right testing one two My name is Joe And I'm here to talk Yeah, that's not bad How does this sound? I'm not really coming through the audio, but I can crank this one up a bit too And let's see where are we at this one's at zero DB. So I want to go up by maybe Test one two. There we go. That should be about right. Can y'all hear me? Yeah, I can hear that just slight echo in the room Which tells me I've got a good level because when you pack people in here, they'll deaden some of the sound sound good Cool. All right Second that feedback. No, that was just me futzing the moon. This is in the back of the room That hits We're gonna wait a couple extra minutes for people to show up after lunch hopefully We get a few more Devrim, can you hear me back there? Is it all right? Is it is it kind of low I have to hold this closer? Yeah, I'll talk to At least I remembered my jacket. Oh, I should hear my teeth clinking as I was talking You're all right. You want me to talk louder? I know Just give me a sign Devrim. All right. I guess we're gonna get started So I'm Joe Conway. I've been around the postgres community for since the late 1990s. I've been a committer since like 2003 I If you've ever used set returning functions, that was a feature. I added long long ago. I wrote DB link a bunch of other Stuff most of the big stuff I did was years ago. Mostly these days. I kind of manage people So I do less coding, but I still fix bugs here and there and do a feature here and there I am also on the sysadmin team for the project and the security team for the project and The team that recognizes contributors among other things And I am a as I mentioned I'm a manager at Amazon. I run the team that does the upstream postgres contributions so Our team is all about contributing to the project and this is where you can get a hold of me I'll have that on the the last slide as well, and I'll make the slides available Somewhere. I'm not sure how scale does it. I can't remember Okay, so we're here to talk about the problem with glibc coalations How many people are aware of the fact that there's a problem? Oh, I expected like all of you since you're here, but I I'm gonna I'm gonna show you some some examples about That they make it obvious that this is a problem. I think it's a bigger problem than most people recognize I'm gonna show you if you run into it what you would have to do to fix it And then at the end of the talk I'm gonna talk about an alternative approach that I developed and it's been in use now for well over a year I Can't say exactly how many instances is running on but it's it's on quite a few Okay, so We'll start out if you're you're on a rel 7 system, right? And I mean is everyone here familiar with the basic, you know syntax of setting up a postgres database in it db and then PG control start right and then I'm into postgres So that's all I did was just build a new database on a rel 7 machine and This query here is going to just show us information about the encoding that the database is using The encoding in the coalition I should say And so you can see that the coalition it's using on a rel 7 machine is 2.17 and that's because on a rel 7 machine Glibc is 2.17 dash 326 at least the last time I checked it was 326 Might be something new right now, and if you do the same thing Excuse me on a rel 9 machine You'll see that you've got glibc 2.34 You say oh no problem. That's just like a minor version change, right? Well, no actually that The way that glibc does their versioning each of 2.17 2.18 2.19 each one of those would be a major version change and Typically for a given distribution They'll freeze that major version. So that's why rel 7.9 still has 2.17 But it the dash 326 means like 326 versions of that package have been released I'm going to get into that later on why that's important, but Suffice it to say as you change Your OS version you're going to get a new glibc version major version All right, so this is the this is what I call the in your face slide For anyone who wasn't aware there wasn't a problem or anyone who thought well Maybe there's a problem, but I probably won't be affected because I don't use I don't do anything fancy, right? Well, you can't get much simpler than this right we've got three values one dash a one a and one dash a a and On a rel 7 machine if you sort those you're going to get this order One a one dash a one dash a On a rel 9 machine you're going to get this order one dash a one a one dash a Okay, so If you thought this was like an esoteric problem that you probably won't run into because you just do simple stuff You may not be simple enough maybe you are but Maybe not you can be just using the ASCII characters and run into this problem and Remember I just basically built the database with all the defaults, right when I did a knit DB so I got En us utf-8 because that's what all my that's what my system is and that's what Postgres will pick up All right, so we're going to drill in a little bit deeper now. Is there any question about that before I go on? Okay Hopefully I've got you beginning to get scared All right, so we're going to start out by creating a table on rel 7 and So I'm it's you know one column table with the text column. I'm going to make it a primary key That's important for the purposes of this demonstration I'm going to insert those three same values into that and Then I just do a order by a select order by from that table and I get that same order We just saw right no big surprise there So now if I take that machine and upgrade it to rel 9 underneath my database This could happen in a number of ways, right? But you know you may not as the person responsible for the database you may not even have a choice right someone else may decide that for you One of the things you should notice though is When I first start up the database if I immediately do that same Select order by I get the same order So you look at that you say well, what's wrong? This is working perfectly There's nothing wrong with my database and So now the next thing you do is you go insert another row Except remember I had a primary key on this column and I'm inserting another row with the same value that I already had that should be an error But guess what it succeeds So now not only am I getting sorting in the wrong order? But I have a corrupted table and if I try and re-index that table to fix this problem That's gonna fail now because I'm trying to re-index a Unique index with my primary key, so that's not gonna work So this is just the the very simplest scenario and Already you can see how it can bite you So I'm gonna go back to this query that we used earlier and I and you can see this part here where it does PG database Coalation actual version this function will actually look basically it uses the libc call to say what version am I? Right and it shows you that and so you can see that you know, we were built on to 2.17 What's on the system now is 2.34? That kind of explains why we're in this hot water now All right, so what if I got to get out of this mess? well First thing I'm gonna do is actually drop my primary key and then how many people are familiar with the CT column Just a few So the CT is a system column It's not normally shown if you do select star from some table, but it's basically a physical row identifier and What I'm gonna do is use that to show me to disambiguate these two rows So I can pick one to delete Now have this been a more complicated table with more columns and If one of the columns was different across those two row versions, you got to figure out which one to keep so You're still you know, it's not clean. You're gonna have a problem. So the lesson here is One of the things you got to be careful about is not only these G-Libc versions, but if you're gonna rebuild all your indexes, which is what we're gonna do to get ourselves out of this You really have to do that before you let anyone into your database after the OS upgrade You cannot let anyone insert update or delete Before you rebuild those indexes or you're gonna wind up with a problem or Shouldn't say you will but I should say You very well might so you can see here. I deleted the second one just because C-tid Larger indicates it's probably the newer one that it was inserted in error Although if you're familiar with the term wraparound in Postgres If Postgres wraps around The transactions wrap around if you've heard of that It's possible that the the newer c-tid could actually be the older row So you can't necessarily count on that either, but in any case you'll have to figure out one of them and get rid of it if you want to rebuild that You want to rebuild that at re add the primary key, which will rebuild that unique index and Then if you do this alter database refresh coalition version What that will do is it'll go update the system catalog to say all right now we recognize that We're running with the newer version of g-lobsy It actually that doesn't fix anything for you. It just like Updates the catalog to say that's what I'm doing. So in fact, you could run this and still have corrupt indexes And it would it would look like you know, you're using the newer version of g-lobsy, but you wouldn't actually have fixed anything yet So now once I've done all that if I run my order by select order by again now I'm finally seeing the order that I would have expected on rail nine for those three values and if I go back to my Query to look at the system table. I can see that my That call version matches whatever the actual on the system now that I've done the refresh Okay, so now we're gonna take it up a notch This is This is something I call the coalition torture test Jeremy Schneider who's back here Created a github repo Which produces So let's see I've got this right here. Oops Sorry again This repository that he created basically generates Roughly 26 million Strings that are in the like One to five or six character length, but they're all like these crazy Unicode characters and He built that as I understand it. He Jeremy's got a lot of experience dealing with Customers who have done strange things So he's he's seen a lot and so he's got some good intuition as to what sorts of things might cause problems with these Coalations changing from version to version of glibc and so he kind of built this test There's no way to exhaustively test every possibility but this tests a lot of stuff right and This test is not a database test. It just produces basically a File of strings and he did a bunch of across different OSes and you can look through his read me And it's got some interesting data in and of itself But what I did was I took that the data from that and I just used a little Python magic to to properly Double up the double quotes that were in those strings and then and then put double quotes around them So it was easy to suck it into a postgres table and so Getting back to the slides If I create this table unsorted table just one column again and I'm going to bulk copy that Formatted Unicode text which is his output with my quotes And I do a vacuum freeze whenever you bulk load you really should vacuum freeze afterwards for a lot of good reasons But that's like unrelated to the stock And now this query here, which I'll use repeatedly is basically doing an order by all of the strings in that table and Then doing a string egg with no delimiter So it basically just jams them all together into one long string and then calculates an MD5 over it So you can consider that like a signature or a fingerprint For our sort right so if that value is the same we know that the sorting is the same over these 26 million crazy strings Now keep in mind as we go through the talk this 71 30 That's the kind of the signature. We're looking for And I also want you to notice this timing here. This was about three minutes To do this sort in a few slides You're gonna see why I want wanted to point that out and I'll again. I'll talk about that in a little bit more depth later on but there's links in here for Jeremy's github as well as the formatted Text that I pulled into my table is available online as well if you wanted to play with this Okay, so now I'm gonna build an index an indexed version of that table So basically I'm just doing an insert into a new table from the unsorted table and I'm gonna create an index on it and You can see when I do that same select again. I get the 70 130. So it's the same order And you can see now it's like three seconds So really fast to use the index and by now everything was cached Okay, there's an extension that comes with Postgres called am check One of the functions available am check will let you check the Validity of an index at a very low level Think I think that some of those functions were written by Peter Gagan. I think for the index checking part of it Most of them. Yeah, so Peter Gagan is also on our team at Amazon Spends a lot of time with his head like way down deep in the index code So when you run those checks against the both the primary key on our little table and on on the index table It comes back with nothing which basically means there was no error So those indexes are currently good So now I'm going to upgrade this same machine again, so I'm kind of flipping back and forth between rel 7 rel 9 I've now taken that Same table that was built on rel 7 onto a row. I've upgraded the OS to rel 9 and now when I do that Select order by from the unsorted table. You notice my Signature has changed because I'm getting a different order you also notice it now takes almost an hour Now I'm like I said, I'm going to talk about this a little bit later, but the the short Thing to catch there is this is a later version of G-Lib C So like I said, there's there's a performance regression that May or may not affect you But if you've got a lot of multi bite, you know unicode Strings that are being sorted. It may matter a lot Now the other thing you can see here and this shouldn't be a surprise at this point But if I select from that index table, I'm still getting the same signature But remember we're on rel 9. This is actually a corrupt index. It's not producing the correct result If it was the correct result, it should match that now right and it still takes about three seconds and Now when I use my Amcheck on the rel 9 machine after the upgrade You'll see I get errors So I get this item order invariant violated So this means basically you've got a corrupt index So both the little table and the big table have corrupt indexes now So this time I was smart enough not to let anyone go in and do any Updates or deletes to my to my big table fortunately, right? So at this point all I really need to do is re-index So when I re-index you'll see now I get the expected order for a rel 9 machine and I no longer get an error from Amcheck Okay, so that's kind of the end of that part of the The scare tactics Any questions about all that Yeah, go ahead This behavior is what you'll get if you started out on say a rel 7 machine with postgres And then you upgraded the OS Without rebuilding your indexes if it's clean install then right out of the gate You're just gonna get the rel 9 sorting semantics, so there won't be a problem But eventually when you upgrade rel 9 to rel 10 or whatever you're gonna have a problem Well, yeah, and this is I'm gonna get I'm gonna get into this in a minute, but yeah So Devin Devin's point is actually covered on the next slide, which is Anytime you have a distributed cluster you have to worry about the OS is not matching Go ahead John. Oh, I see Jonathan is is Feeding me a leading question. Yes, actually it's that's a good point. There's There have been spotted in the wild cases where Minor versions of G-Lib C Changing Has caused changes Unintentional changes in coalition so indexes have been broken by what should have been a minor version change of G-Lib C and by the way, it's also been spotted in the wild where there was a Minor version of ICU that broke coalition So I mean, it's not just G-Lib C. ICU has had this problem One of the things and again, I'll talk about this a little bit more with the alternate solution but one of the things you find out when you really start digging into this is G-Lib C uses some data files and it uses some shared object libraries that it pulls in in order to do all this locale magic and so it's dependent on data and code and You can break you can change coalition unintentionally with either If the data changes you could change coalition, but if the code changes you can definitely have changes in coalition occur Michael so which this was Susie So Susie actually did a major version in a service back. Wow. I Didn't know that one Okay, that's that's good. That's good to know Go ahead. Oh, so you you've basically facilitated database loading in the github repo so Jeremy is has made his His stuff more convenient to use if you want to test the database Awesome All right, so this is a different example. This is though. This is the distributed cluster example so in this case I'm already I'm on a rel 9 machine and I'm creating the Postgres foreign data wrapper Right, and then I create a foreign server that points back at a rel 7 machine And I'm still using that little three-row table, right? And so I create a mapping for a user and then I create a foreign table that Points to that three column that three row table on the rel 7 machine from the rel 9 machine Is that all clear as mud Now You notice one of the things I did here with these options I wanted to force a certain plan in order to facilitate what I'm trying to show here But you know, suffice it to say that you could run into this in the wild within real real life use cases So once I've done that now I'm going to do this Join of a local table on the rel 9 machine to my Foreign table that's mapped to the rel 7 machine I'm just trying to join on that column right and those three rows should be the same in both tables and It should just work right, but instead I get this error merge join input data is out of order And why is that because a merge join assumes that the data is sorted consistently on both sides of the join And because one of them is sorted differently on rel 7 than rel 9 It detects that that it doesn't know what to do Postgres doesn't know how to handle that right so it throws an error and you can see here in The explain plan you can see this is one of the things Postgres will do is Try and push down the order by if it can right and so By forcing this plan where it pushed down the order by I got on the rel 7 machine I got its order and I tried to join on the rel 9 machine with its order, and that's why this didn't work Okay, so here's a third type of problem, and I'm not going to say by any means These are the only three types of problems you might run into I just You know really couldn't get any more slides into this talk and thought this was enough to convince people They need to be worried about it, but This is another kind of legitimate thing that you might run into If you've got partition tables and you partition on a range So you can see here I've created a partition by range and Then part one is from min to one dash a and Part two is from one dash a to max right and so when you do this It's basically inclusive on this side and exclusive on that side, right? So this takes everything up to but not including 1a and this takes 1a and everything greater, right? And so on the rel 7 machine when I do an insert of 1a That value ends up in the first partition But now on a rel 9 machine if I insert that same value, it's going to go to the second partition so if you were depending on You know with partition tables you really can't have a primary key that spans the partition, right? so if you're depending on their petitioning scheme to maintain uniqueness you just failed and if you're looking for your 1a value on Your rel 9 machine it may not be in the place the partition where you expect it So again, this is just a third type of problem All right, so I'm just going to summarize everything. I just talked about your your coalition Was almost certainly provided by g-libc if you're on postgres 15 or earlier on 16 There were some changes made that made it more More reasonable to have icu sort of database wide I don't think many people use icu database wide. However So probably all of you are using g-libc for all of your coalitions So it most likely affects you Sword order relies on coalition indexes persist sword order Constraints may depend on order. I mean that's another thing I didn't really show But you could have like a check constraint that depends on ordering Partitioning depends on ordering merge join and other operations may depend on ordering distributed clusters Might depend on ordering being the same and by the way, this is not necessarily just postgres, right? There are lots of distributed systems out there that today and to the extent that they depend on g-libc or even icu for their ordering You could be getting incorrect results as things get merged together things. I don't know how Hadoop works, but That type of use case comes to mind So here's the other half of why it's important Actually in my my slide until this morning I thought rel 7 was end of life in 2026 and I think maybe a year ago when I first built these slides It was and I think they may have pulled the date in but it now very clearly says that the end of June in 2024 is end of life for rel 7 and I'm willing to bet a lot of rel 7 machines are still out there and Debian also ended June and Ubuntu 14.04, which is one of their long-term support versions is end of life in April Maybe my battery finally died and doing that before Okay, so so the problems we need to tackle are broken indexes, you know again don't Don't rebuild your index don't allow any DML To occur in your database before you rebuild the indexes if you've done an OS upgrade Otherwise, you'll have a bigger mess to solve Distributed systems you have to pay attention Replicas can have problems Foreign servers can have problems All right, so now I'm going to get into the third part of the talk Which is the solution that I worked on for this problem So anyone have questions about any of the preceding stuff before I move on How am I doing time-wise Devon? How much 25 minutes? Oh, I got plenty of them. Okay. Good All right, go ahead 28 was the was the huge change Well, I think Jeremy's conclusion. I think has been that every major version of G-Lib C has some changes To 2.28 was like a huge change So that one was like the obvious one, but every major version has Potentially some changes somewhere No, there have been some where nothing changed Right. Yeah, it doesn't it doesn't mean that just past just because you're past 2.28 doesn't mean like you'll never ever problem and you know to some extent some of this depends on What Unicode does too, right? I mean, I think some of the problems that people run into are when They have data in the database. That's an unassigned code point So it's an unassigned Unicode code point, right? So some data gets in there through some user form or whatever, right? It's it's It's really not an assigned character and then some later version of Unicode comes out and it like assigns it and then it needs to put it in a proper sorter and So that character is suddenly going to start sorting differently, right? I think I got that right So you didn't see any changes between 8 and 9 or Okay, so the solution I worked on like we called it lib compact coalition after lots of discussion You know one of the hardest problems and computer science is naming things, right? What it is is a method to build an extracted The locale functionality extracted out of glibc by itself standalone kind of in a nutshell and that is Done in a way that makes it Portable at least in theory I mean it in practice. I mean we've I've tested that across several Major versions and it does work can't promise it'll work everywhere for everything But for the things that we've needed it for it's worked great It allows you to basically pin to a specific Coalation semantic based on a very specific glibc and You know we've tested this both for x86 64 as well as a arch 64, so it works with arm and you can either Link postgres to this library and use those versions of the locale Functions or you can use LD preload Either one that you prefer Now this is available in On github I was going to talk about this. I think actually I talk about this later so this is this is the location on github and One of the things to note here because I've gotten this question a few times if you come into this Repository you just look at the like the main There's like nothing there. It looks like there's nothing there and this this is it's an odd way to handle the repository But after a lot of discussion about the details of how this stuff works. We decided it made sense So I'm sorry if you don't like it, but this is the way it is So you'll see there are two branches here two seventeen three twenty six el seven and two twenty six fifty nine Amazon two so basically what we've done is we've created this Based on very specific rpm package versions one of which was from rel seven and one of which was from al two and the reason that these are so specific is Because as we talked about even on a minor version change You can get subtle changes in behavior So how you would you if you wanted to use this you would clone the github repository? You check out one of those two branches right and then you just basically run G-libc compact collation build and then install the rpm and that rpm has no dependencies You build it on a rel seven machine You should be able to install it on your rel nine machine and I've taken the binaries out of that rpm and installed them on my Linux mint desktop And they work just fine Okay, so you know again the way this is built is if you go back and look at this repository You'll see that it's built on top of specific G-libc sources including it's basically what what happens is if you did an rpm source install let's say for G-libc 2.17 You would get a sources directory and a spec directory right and in the sources directory You see this? I'm just gonna scroll this really quick. See all these Those are all patches that get applied by the rpm Maintainer so whenever you install rpms on a Linux system the maintainer of those rpms are applying patches to the upstream tar ball Which may or may not change the behavior of coalition? Right, so in the spirit of like I don't know if you've ever heard of Intel copy exact But the idea here is we want copy exact we want exactly the same you could call it bug compatible Whatever you want to call it we want exactly the same behavior of as we get from one very specific rpm version of g-libc and So that's what this is built on top of is it basically we do an rpm source install And apply our patches on top of it and that's what's in this repository There are two distinct types of changes there's a few small changes to the g-libc source code itself and then there's some Somewhat larger changes that are required to actually get it to build The goal was to minimize the changes because again, we want to make sure that It was easy to reason about the changes, especially the ones to g-libc itself To make sure that we weren't going to inadvertently change coalition And they kind of fit into four categories. There's their g-libc had has a lot of Kind of hard-coded assumptions about file paths So like I said earlier g-libc locale functionality depends on libraries that it loads as well as data files that it loads And those are in a very specific path on your system. Typically something like user lib locale or something like that, right? Well, those are hard-coded at build time for g-libc and we want to have our own set in Parallel with the system g-libc So we had to modify those paths so we could have a path that we could control what's in it We also wanted to have G-libc is a is kind of very Cross-linked in that, you know since virtually everything that you build on a Linux system It's coming from libc, right the all the basic functionality, right? If you want to allocate memory that's going to come from from libc, right? So these things are all tightly cross-linked and what we wanted to do is just extract just the stuff that was needed for locale But in order to make this portable we wanted to be able to run on a system with a different version of libc so we didn't want to pull out the The functionality that wasn't specifically related to the locale and then kind of related to that G-libc Exhaustively kind of versions that symbols and I'm not sure how many people here are familiar with symbol versioning And libraries kind of sort of so like if I've got a Function call string call in G-libc under the covers. It's really string call G-libc dash To dot whatever right and that will indicate which version of G-libc that Particular version of that function comes from and when you link that into your program, that's what your program is going to look for so we wanted to replace those with the versions from This compatibility library so that when you inspect the binary it was really clear Which of the function calls were actually coming from libc versus which were coming from the compatibility library? As I alluded to you can preload in that case you have to build slightly differently Which I'll show in a couple of slides and when you do that it will actually preserve the original G-libc symbols and And then some really minor changes to the functionality Really, I think mostly just that that one function that report reports the version of G-libc Instead of reporting 2.17. It's now going to report 2.17-326.el 7 so that is very specifically tells us What this came from so for the package building code? We had to touch the spec file basically get it to build what we wanted to not build all the stuff all the other G-libc stuff that We didn't care about The custom build support. There's three main files this build files dot text Basically in G-libc every Every function that you would call from libc like again mem copy or something has its own C file in G-libc tree And so this build files text basically shows you all of the functions that were extracted out of libc in order to build this collation library The libcompact collation map is the symbol map so for the all of the symbols that we wanted to remap to our own symbol names That control is controlled by that and then there's the build scripts themselves And then as I talked about if you did wanted to use this and you wanted to preload and not Not specifically build postgres against the compatibility library. You would just have to tweak this one setting in the build file Okay, so I just want to go through some lessons learned that I had as I was developing this thing As I talked about when you build G-libc There's all this functionality that we didn't want and Much of that functionality actually is is Intimately tied to another library in your system, which is called LD so LD is the dynamic loader So when you load a shared object library when you're when postgres starts and it loads all these shared objects for extensions and whatnot LD is what loads them LD has some data structures specifically this 10 minutes, okay RT LD.global and the art read-only version of that that keep track of things like about your CPU and other locks and stuff that it's holding and those structs as you can imagine change from one version of G-libc to another and so you didn't want anything left in the compatibility library that referenced one of these structs And at first it took a while to figure out where all those were and get them out of there But that was fixed pretty early on Again that that rel9 timing that I talked about there's actually a commit This is the commit in 2014. I don't know if you can read that It says improve performance by removing a cache Now To the author's credit he goes through a lot of he has some data here that apparently tested something and Proved to himself that this was faster in most cases The way people use G-libc. I I have I haven't really looked at it that closely But I have to assume he just he didn't use certainly didn't use Jeremy's 26 million string torture test Because this this commit is what took us from three minutes to 60 minutes for that sort so One of the benefits that you get out of this if you were to stick with the This library from 2.17 as you get the old faster version on the newer system But by the way, if you I ran the same test with ICU and they were about the same speed as the faster earlier version of G-libc it It changed it changed sort order. Yes That's an example of C code Only changes that affected sort order, okay, so next thing I ran into an issue with C type in it. This was a threads thing didn't get picked up initially because Postgres is not multi-threaded So didn't run into the problem, but it turns out things like PG bench PG dump can run multi-threaded and What happens is the C type in it is needed to initiate Some structures that are used for the C type calls, which are things like upper and lower and what type You know, what type of character is this and those happen to be thread local variables and so when Libc does this it's it runs this at startup So when light libc loads it runs this but then it reruns it every time it starts a thread Well, the problem was is since That's running in libc not in the compatibility library It wasn't Re-initializing these variables in the threads and so that was causing problems, but that's been fixed as well And then there were some cases P-SQL was the was probably the best case. It was linked both to the compatibility library But also to libpq libpq the client side libpq was not getting the Compatibility library linked because of a filter in the make file. So actually if you want to fix that I have a patch which is Linked somewhere on one of my slides That will just it's like a really simple patch to some make files in the Postgres tree Which get them to actually use libcompact call for some of the tools But because of that set locale was actually getting called in one place in The compatibility library and in another place it was getting called in libc, and of course that didn't work very well and then the final problem I ran into is that The difference when you build with optimization or not you build when you build a binary in in a Analytics system with GCC you typically say dash o 2 mean means optimization level 2 so that kicks in Certain things like inline functions, but if you build with dash o zero Which is as a developer you often will do because you don't want the optimizations because it makes it harder to debug So I was building with o zero. I wasn't seeing this problem. We had the real builds with O2 and suddenly it was like well, why is this breaking and that's because there were some things missing from the exported symbol map All right, so real quick. I think I'm I have five minutes Is that allowing for questions I get until 330 All right. I got ten minutes All right. Thank you It's like daylight savings time. I just got more time All right, so I'm on on a rel 9 system I install the rpm for my 2.17 rpm for the compact call and Then in when I configure this now, this is a source build a postgres I assume if you're gonna get into this you're at least comfortable with This sort of thing when you run configure you have to have libs and then include that on there and the way GCC will build if you build with a library Specifically on the link line It's linked before libc and therefore if the symbol names are the same Your version will get used instead of the libc version. Oh, and this is I referenced that patch earlier So if you really want to capture all of the tools You have to fix those make files. That's what this patch does. It's maybe something I should get upstreamed I haven't really tried too hard though All right, so you test it out So you remember we're on a rel 9 system But now we're seeing the same sorting that we would see if we were on a rel 7 system And if we try and insert a duplicate value into that table we're gonna get an error which we'd expect Doesn't go in there and corrupt the index and now if we run that same query again The actual coalition is showing up. This is that change. I was talking about in the glibc version function it now shows The specific version that we're running at and so when we use when we do our string test again Now you're getting back to that signature that we expect from a rel 7 machine But it's on the rel 9 machine and the timing is back to our three minutes and of course our indexed version is also getting the right answer and Amcheck shows no problems with our indexes Okay, so just a couple more slides. This is really just if you're interested in all that all of this topic If you wanted to play with this This is one way that you can examine your Postgres binary and see which symbols Have been linked to the compatibility library versus which symbols were linked to glibc itself And so if you see something like set locale is limp is link link to libc then Something went wrong and then this is sort of the same kind of thing This script will go through your entire build tree and make sure all the binaries all the binary objects in the build tree Have been built all the symbols that are in the map file are actually linked against the compatibility library and not libc so these are just some helper things that I developed along the way to Kind of validate that everything was working so in summary G-libc coalitions can be hazardous And kind of covered it in some ways that you could fix that or at least deal with it and I showed a an alternative approach So I think I'm got a few minutes left for any additional questions Five minutes Yeah, go ahead So the question was is Devrim gonna include that in the RPMs and Devrim gave an emphatic tone Jonathan right so Jonathan's question is what do we need to do upstream to make this better? Part of that answer. I actually meant to talk about this Yesterday Jeff Davis who's also on our team at Amazon while I was working on this kind of how do we deal with the legacy stuff? I asked Jeff Why don't you please look at fixing this going forward and so Jeff has actually gotten the very first commit That provides a built-in coalition provider. I think it's not complete yet. I Understand from him. He he's hoping to get everything in there So it's 17 at least has an option for a built-in coalition Which then means this stuff is all in the hands of the Postgres developers and we're not dependent Excuse me depending on what's going on in the system So that's part of what needs to happen. I think it doesn't solve all problems but for a lot of use cases, this is just perfect because You're you're really depending on your indexes to do Lookups fast and joins fast and there's only some cases where you care about, you know does 1a sort before 1-a or not right and in those cases you can actually put a collate clause on your SQL and Get it to use a different collation so and and this will be a lot faster even than The 217 G libc to boot and it'll be faster than ICU because it's just basically doing binary Comparers so that's part of it. The other part of it. I think is it's something that's come up Thomas Monroe posted like a proof of concept of been I guess a year or two ago now, but He he posted a proof of concept for a way to load multiple Coalation libraries Simultaneously into Postgres and really ultimately we need that so that when you go to the new OS We still need dev room to like support old versions of ICU Right, so when you go to the new new OS you have to have the old ICU and the new ICU and then your indexes will Still work off the old ICU, but then you have a migration path Online to start building indexes concurrently that use the new ICU and so that you can kind of in a controlled way Cut yourself over from the older to the newer so that's ultimately I think what we're gonna need to get to I that's not gonna happen in 17 Maybe that'll happen in Postgres 18. You know we can cross our fingers well, unless you build it with the The preload preload option so that it can only be used preloaded So if you do that then basically they just have to install the rpm and set up LD preload I mean the reason we didn't default to using preload is because it's a little bit It's a little bit harder to verify that you're actually getting the symbols pulled out of the right library when you're preloading and If someone like messes with your preloading settings and all of a sudden you go from being using the compatibility library To using libc that would be that would basically break your indexes. So I think preloading is a little bit more fragile But it would be easier because then you don't have to rebuild all the postgres binaries. You could just it would just work Michael did you have another? I think we got like one Absolutely. It's an open source project. So if you wanted to you know, I We needed the al2 version I did the the red hat 7 version because I figured that covers a lot of people, right? But if you had another version and you and I'll I'll be happy to you know I'll be happy to help you Get through doing that. It's not What I originally developed this for al2 and when I did the rel 7 version I think it took me like a day and a half Of hacking to get it to work So it wasn't like it wasn't like a moon launch, but it wasn't like trivial Well, yeah, that's true if it's If it's debbie, and I'm happy to work with you to try and make it work even on debbie And it's just that would be a heavier lift. All right, I think that's all the time we've got so Thank you very much. If you've got any other questions, I'll be around in the booth on the weekend. Come visit us Hello Is this thing on? Test test test test test. Oh, wow. So it See what we can do Yeah, so I was in brussels and I did this presentation and you know, so super quick You got to get it out. I thought it took it. I got home Because I have a real night like a spotlight I emailed odds. I was like, oh my goodness. I really like that clicker a month. I've been searching. I was like, whatever I go to pack and I pull something out of my bag like I had been through the whole bag and put out in the trucks I was like And I forgot it at my table It was in my bag the whole time that I left it sitting on the table What's that? I have no idea. So is this actually coming through out there at all? Yeah, yeah, yeah, okay. Hello. Hello Yeah, it's not the same thing was happening. If I hold it really close, but They had adjusted it yesterday for some reason and it seems like it's Who uh, who quit me out? I actually I'm gonna type at the end. So I Want both hands free I guess I could put you check check check. I could put that on the stand, I guess Transforming data with the power post. It's really quiet, right? Hello. Hello So that's better, but it's still I have a loud voice and that's not really loud. Oh well It's okay. All right It's mostly for the recording. I wouldn't need this otherwise, but Are we're going to give another minute or two? Uh, it's just cold in here. I'm sorry about that. I wish I could fix it I've tried like three times ask them and nothing's changed. It's slightly warm up front All my fonts hopefully are big, but maybe you'll see better if you come up How many folks are regular postgres users? How many of you are regular postgres users? All right But not everyone These speakers really are not loud like I am eating this thing and it's really not you can hear we just fine No, you're close Can you hear okay back there? Yeah, okay Anyway, I'll do this But that's great. A lot of people have done a lot of stuff. Yeah, there are so there are so many possibilities at this point for sure All right, I'm going to go ahead and get started. It is 346 Uh, probably we'll definitely have time at the end for questions Uh, the title of this talk has changed like three times as I give it I learned, you know, I changed a little bit and realized like, okay, got to figure out a Better way to identify exactly what we're trying to go through here. So for me I'm going to talk about really taking raw data Transforming it into something that hopefully is usable for doing whether it be analytics You know or some kind of transformation, which we may have done previously Outside of the database There's a lot of power in postgres to help us do that internally So this is me ryan booze. I currently work at redgate Redgate is a database tooling software company based in england They have primarily for 25 years been in the sql server and microsoft platform space They acquired flyway. We acquired flyway, which a lot of you probably have heard of it's a migration tool for doing database migrations as you upgrade and change your schema Acquired it in 2019 and been putting a lot of energy into it. There's still an open source version of it And we now support that even that open source version supports upwards of like 60 some database connections To do migrations to snowflake and all the others. So I am ryan booze just about everywhere So, you know linked in wherever you you have You can come find me. I love to chat. Don't be afraid to reach out I'll leave this here for a second and it's also at the end. So I was doing the I had this conversation on twitter yesterday You're welcome to take as many pictures as you want But I just want you to know that this deck is already up there You could download it right now and follow along if you want Um, but it is there. It's a pdf and it looks like this So this is the repo. I have a presentations folder and all my presentations are there I try and get everything, you know, anytime I do a talk a second time or a third time It's always updated. Um, so this will take you to that. It has a sequel script I'm going to show you at the end and the actual pdf of The talk Oh, sure. Yes, you can you can post a picture. I totally get that you're right all right And I also So videos like this if something's recorded and it's public I just link it to a playlist for each year So my playlists are here, you know, if the talk is helpful and you want to go back and watch it I'll make sure it gets linked here once it's public. So All right, so youtube again at ryan booze. You'll find the stuff there All right, so here's the talk Right five basic sections and we'll do a little demo at the end We're going to talk about elt versus etl All right, we're actually etl versus elt and then talking about how we can use various features of postgres To set up the building blocks to then go even deeper Once you've learned the building blocks, all right And when I've come to learn about myself and as I go to conferences and talk to people Is sometimes it's easy to assume that everyone understands exactly how recursive cte works Or what function in postgres allows you to split strings in a specific way to get Values out and this you know this that and the other thing Because it feels like once you've done it for a year or two like oh Everyone must know this and realize that that's not the case. So when I first did this talk It was this monstrosity of all the fun puzzles I had done with advent of code and all the fun things I did to solve all those puzzles And after about 50 minutes, I'd gotten through three puzzles. I was like, okay I have to rethink this Um, and the reality is I realized that most of the folks that were coming Didn't fully understand just some of the building blocks that would help them do these kinds of things in their environments So that's what the majority of this is I also if you go to that github repo you will find some of the puzzles I've solved over the last two years With advent of code using sequel and that was a challenge by the postgres community. It was fun I love solving puzzles. I've learned a ton About how to better use postgres and sequel so advent of code A lot of this has been uh, you know this Uh, I kind of really got into this a couple years ago trying to better understand how to take what seems like Non-relational data and making it relational with the portal Back in 2021. I guess I was trying to do a sample of I wanted to take those emoticons or the emojis Sorry Those little squares and turn them into data. I can analyze And sure I could do it in a hundred other ways. I could do it in python I could do whatever language this and that but I thought what if I could just take the In this case the tweet the text of the whatever we call them now Shoved it into a database and then with sequel alone transform it Into a table of data that I could actually analyze And in learning how to do that I discovered new things about postgres. I had no idea about I didn't understand what was going on And that opened up the opportunities So both of these were kind of the on-ramp for me and that's why I'm going to use them as some examples today Uh, most everything you'll see here is available in postgres 9 6 forward There are one or two functions that I will show you or are in the sequel that are only in postgres 14 plus. All right I'll point out the one. I think that's in here. There are a couple others So when we get to the samples some of the sequel I'll show you in slides and then when I do the demo I'm primarily using these two puzzles. So you can go to advent of code You can go look at the the content the puzzles what you're supposed to solve Almost I think for any of the years they've done it. This has been on for I don't know Does anyone know 12 13 years something like that? So I picked day 7 for both years. It was totally a coincidence I'd actually didn't realize that till I put the slide together so day 7 december 7th puzzle for 2023 is really simple and You know, it's a very simple kind of input and allows me to show a lot of this and then Last year's day 7. So basically the the 2023 puzzle is just a simple line It represents a hand of five playing cards and then a value And the puzzle was to split all these things out understand the cards And then you do some addition and multiplication based on a bunch of other things But you got to take that string and do something with it And so that it just allows me to show you something pretty easily And then the order of the cards in the hand left to right 1 2 3 4 5 is important to later solving the problem in the puzzle Last year's day 7 Is going to familiar to a lot of us and that's one of the reasons we use things like recursive cte's It's basically about moving up and down Through a folder system on a computer and so we can take the input the command cd and You know up down whatever and find Where all the files are within the folder structure and again the the puzzle was basically defined I think the folders that had the least amount of data To delete that folder because you needed to add a file You needed free upspace But again, it's moving up down through file system, which all of us have done at some point So let's talk about etl versus elt. So who has heard of etl Most of us have who has heard of elt a little Fewer of us So the whole point of etl Is to take Non-relational data And convert it into something that's relational so that we can use the power of sequel To analyze that data Now the data can be A text file The data can be a grouping of you know a text file and connecting to another system like snowflake and bringing that together Connecting to a parquet file with a text file with you know, whatever the sources are Could be many different sources, but you're trying to get that related turning into something You can get into a database and do analytics on The differences are really minor, but really You know powerful depending on which way you want to handle this data, right extract transform and load Is primarily external There are tools that have been built over the last 25 and 30 years to do this for you, right? They're specialized tools and they can do all kinds of transformations But they're special they cost money And it's outside of the thing you're ultimately going to use which is the database Extract load transform is just that small difference. We're still extracting We're getting the data from wherever it's at But then we're going to load it into the database first And then use the power of sequel postgres and the things we can do there to transform it So why would we do one over the other? Well, let's talk about etl briefly. Why has it been so popular? And it has been popular for a long time. This is the primary way That we take all this disparate data and turn it into something we can query It's been popular for a couple reasons. It's not reliant on a specific database, right? It's often external tool That can pull all things together Often kind of in a pipeline you in events together to process data in some form that eventually gets it into a table like thing That you insert into a database and then you move forward Database is the other, you know, probably the primary reason is as the web grew Most of the databases we use at least until the last couple of years did not speak the language of the web very well Right, so it used to be xml a long time ago And then the last 15 plus years we've been using json for everything And most databases didn't have a way to store json efficiently and to query it well Right, so we had to have something else to transform that thing to get it into the database so we could do it Well, that's that's change or starting to change pretty dramatically Then honestly one of the other reasons it still exists and still popular is It developed an entirely new class of jobs And so people have learned and they become experts in these tools The tools often cost a lot of money for the licenses You might only have a team of two or three people But they're really valuable and it's hard to want to get rid of that like You know special team if you're not sure that you can do everything you need to otherwise, all right One of the problems at least in my experience has been though with etl is because it's external Iterating when new things happen can be really slow Because you're relying on an external tool. You're relying on people Usually you have to have these different requests. You have to then decide as a dba or developer Okay, what's that mean to my schema? What do I need to change if I add the columns I need and I give that request forward They're going to understand it. Am I going to actually get the data I need? So there's a lot of iteration that's outside of the database and it slows things down. You can't kind of go as quickly as you like Now if you can keep that processing in the database Closer to the database at least you get a lot more control over those iterations But you you have things like temporary tables You can try something without actually impacting your schema yet see if it's going to work and move forward You don't need to rely on someone else to do it And that's really then where I started to learn more about this concept of ELT now. I actually in my career I'm 22 or 23 years into this whole tech thing I did a lot of etl a couple jobs ago. We had you know thousands of client databases We would pull stuff together to have metrics and to you know Pull is this client? Is there kind of data for building data similar to the other clients within this group? this region what have you But as I started to realize that I had all this information now with apis I could grab and I could start to do this in the database as concept of ELT became really important to me Or really interesting at the very least so A couple of benefits number one it retains the whole transactional nature Right one of the problems with an external tool is you don't have that control It's going to process what it's going to process But if it fails in some way If you were inserting that data into the database you would know the rows that got committed got committed Right And so you lose some of that because it usually ends up being a batch process and some things You know you might lose some of the nature of the data And postgres has added over many releases a plethora Of functions that can do you know all kinds of transformations with A lot of data types Whether it's json, whether it's text whether it's array data There's a lot that you can do and and honestly for me one of the things that was most helpful or interesting as a developer You know my kind of developer Experience coming to the foreground It's just something simple like arrays You know when I understood years ago that postgres had an array type that was mind blowing to me And then to recognize a lot of these functions End up using or exporting array types and there's a lot that you can do with that So this all works together to say I can get some data And then with a few functions all of a sudden it's a whole new it's a whole new thing that I can start to analyze All right, so etl versus elt one is external insert the final result Elt is insert as much as you can as raw as you can and then transform it I want to talk about inserting data because obviously that's you need to get started if you're going to load this data And you have to worry about this with etl as well But when you're trying to get this raw stuff in as quickly as possible just need to know how to do it a couple quick slides A couple things I would say is if you're going to try and work through something like elt You want to do your best to get The simplest form of data Simplest schema that you're going to add it to First right so if I have a text file and maybe it's a csv Maybe I don't know what columns I want. Maybe I don't even want to split it apart yet Because I don't know you could just save rows of of the csv to a table JSON just save the JSON document pull out what you want, but keep it simple Don't get too complex initially And then post process all the stuff that's in there, right? So I just have the staging table with all the rows and chunks of data and then I can figure out later How to start to do that that saves you that step of a lot of time up front What do I do with this thing first? Well, if I just have access to it I can go ahead and just start trying testing playing with it If you're going to copy it in, you know, just a couple quick things Use copy if it's something like a file. It's probably the easiest thing now um supported by you know most um Sorry is the most important method method for getting data in I'm going to talk to on the next slide Typically csv, you know, there are other things you can do you can name your own delimiters if you have custom formats But it's a good way to get data in and then I'll just say this I worked at timescale for a couple years and most of my history is about time series data And when you get lots of data coming in quickly, you learn that it's best to work in batches And so if you're going to start to ingest a lot of data with postgres and you're using some kind of You've developed something python go whatever it is Don't insert one row at a time batch up 10 batch up 50 and then do an insert and you'll see the performance Go up pretty dramatically just by doing that Um go to 500 go to a thousand test it see how fast you can get some data into your database Now two words about copy and I just put this here because again You're going to walk away if you haven't used copy much you go to play with it I just want to get this stuff in front of you So copy is a sequel command in postgres, but it's a postgres command. It is not in the sequel standard it actually is a Predates postgres as we know it it was in the older version of postgres And it's one of the things that came over And it requires a file. So if you use the sequel command copy It's looking locally to the server the file system of the postgres server Now most of us in a production system don't have access to that Now there are ways you use a hosted solution like amazon There are ways you can use copy to access like lns 3 and so forth, you know, whether it be in redshift in some other ways But you're looking local well, most of us don't have that access Honestly, even if you're using docker locally You probably aren't putting your file in the docker container, right? So they're just on your file system So what you usually then do is use the pcql Meta command copy so the slash commands we call them meta commands And so it's a meta command called copy and what that does is it reads the file off your local system It simply streams it in these in the copy format Into standard in on the server. So you're getting the same benefit But you don't have to be local to the server Now the only thing you have to know about copy are there are a couple quick gotchas number one It requires things to be in order if you're doing a csv You can't just like name various columns. You can exclude columns, but you have to know the order of the columns going in The types have to match. So if you're trying to insert into an integer column and it's text It's not going to work. I have no idea if this is behind me, but I kind of want to listen Anyway And there's no data conversion So if you have a date format and the string is different than what postgres expects It's just going to fail and the problem is any failure Is an error it stops and any work that's been done is actually in the database But it's not visible and so it's taking space and it won't go away until you vacuum sometime later You could have inserted three million rows of data and three million and one fails And you lost three million rows of data Right, you got it to start over again. So that's one of the problems with a copy the larger, you know, the more data you insert However, there are other tools So pg loader is one that I refer people to if this is a process you're going to do It's a tool written a number of years ago started by Dmitri Fontaine A number of people contributed to it and it basically takes the file It allows you to specify various Transformations if you want so it can identify that the date format is wrong and it will transform it Before it does the copy for that row And then if there's an error it can basically save that row out Ahead of time and then just keep going. So That's one tool. There are some others similar to it if you Again, this is in that playlist. I shared earlier, but this is the talk I've given a few times And that's just four plus. There are a whole bunch more that you can play with But a lot of these same principles just to to go through Any questions right now? Yes That's correct. So it is um textual data with some kind of delimiter It can't handle json doesn't handle xml things like that. Yep It's not helpful or things like that. So that's where often great question So the question was you know is I should have said that earlier. I apologize um You know just copy handle something other than just kind of textual data whether bcsv or some of a delimiter It's not set up to do that right so it won't handle copying json in now If delimiter correctly and there's a call in this json There are ways to actually get that in and it will work and sometimes it gets a little bit wonky and people have trouble but Just inserting it directly and trying to transform in some way wouldn't work That's where you have to use probably a programming language because you are interfacing with an api and you're getting data out Well, then just use that Whatever language you're using to save into the table and then you go about your business all right So when you're doing something like elt My what i've learned over four or five years of just playing with this stuff And helping some clients do things because they're trying to iterate quickly Is just keep it simple Back in my older days of etl I would you know, we'd spend a lot of time thinking about the schema we wanted So that we could transform this data and it was a really long process But if i'm going to do it in the database, I can just get the data in there And so keep the initial table simple. The only thing I will say is if When you're inserting data, you're going to see me do this in a little bit Generate an id column for every row that gets inserted. You don't know when that might be helpful for you later Right, so just so you have the order that things are brought in there that might not exist in the data itself Should just create an identity column. You'll be surprised. It might be helpful at some point If it's time series data, maybe put the time series the time stamp of when it was inserted, right? So you have a time order in some way that might be might be something you need now There's probably a time stamp of that thing a metric or reading whatever But you might it might be helpful for you to know It was read five minutes ago, but it was inserted right now in the rows that were inserted in the batches and so forth Again, I've found that to be helpful when it's time series like data um, and then again Just preprocess what makes sense. I'm going to show you an example in a second Don't go overboard Don't try and pull every single piece of information out and get it into a table because in two weeks You're going to change that and you're going to go back and I'll show you why in just a second just Do as little as possible initially and then get into the database and do your work So this is an example that you'll see a little bit later. So this is that december 7th puzzle It's just rows of text So I didn't do anything with the text externally I simply said, you know, I'm going to copy the data in and notice I generated an id column Right. It wasn't part of the file But that way I have that number in case I need to reference order in some way Maybe I maybe I need to do window function and I need to look back to some number of rows in some way on some id You have no idea when that's going to be helpful Now wordl This is an example. This is a portion of the json From the twitter api a couple years ago I was reading in a couple million wordl tweets over the span of a couple weeks And this was a chunk of what the the json looked like I didn't need all of this So initially I just wanted to make sure I was saving them. I didn't want anything to error I just wanted to get the data in so my initial table Was just this I was pulling out the time you know the timestamp. I was saving it I was pulling out the tweet id so I could reference it later if needed And then I just saved the raw json And I just let that thing go then I knew I was getting the data and I could work separately on transforming it This became a part of what I did then as I transformed the data now I kept a separate table Of the raw json and why I did that is This wasn't my initial go-around. I thought oh, I only need the author handle and you know the tweet text And something else And then I would iterate a little bit more and said oh, you know what I do want to know The location because maybe I could start to correlate tweets of where people are at And so I could change the schema go back to The you know I could link on the id of the tweet and I could just update this data because I had access to it Now if you're storing all that data a second time. Yeah, that's gonna be it's gonna cost right you're gonna have to store that somewhere Now in most cases if you're in a hosted solution, you'd have some kind of times You know, I'm gonna kind of store it forever. I'll archive it or I'll delete it But in Postgres you could save it to a table space that's really cheap storage like if you're in the cloud That table could be stored out to s3 in some way right or something some kind of object So it's really cheap you have access to it if you need it. It might be slow to get it back later, but you still have it all right Does that make sense get the raw stuff in as quickly as you can And this probably this table went through two or three iterations before I got to this Because I just wasn't sure what I would need from that other data I want to quickly show you this as just an idea. So Everything I just talked about was taking your data and putting it into a simple table first Now what I have learned to do particularly with the advent of code puzzles Is has anyone done the advent of code? Any of those puzzles okay cup of you have So the way they're set up is each day a puzzle gets revealed and there are two puzzles for one set of input data It's always about Christmas and usually about elves doing something and it's always funny and complex and weird But it is what it is And I would realize in in every puzzle you get a Little chunk of sample data right the actual input is a larger file But in the descriptions they give you you know 5 10 15 rows And it was just easier for me to do a derived table Often to start I didn't want to deal with schema. I didn't want to create tables I didn't want to download the the file and import it yet Because the sample data They give you what the solution should be for that sample data And so I could work on my Solution without importing data creating tables or anything So in this case it was five rows of playing cards five playing cards space and then the number is a bid It's a value you do something with later And so I just copy those into into this um Reg X split to table So I essentially got rows of data and I could just then start iterating through ideas of how I was going to process this And then when it worked I could create the table import the main thing and get my solution So you'll see me do this sometimes and honestly Think about it like think about as an option if you have a new puzzle You know something you're trying to solve grab a couple rows of sample data or sample text or whatever it is sample csv lines Throw it in something like this you can get rows and see how you're going to transform it I'll show you a little bit later if it doesn't quite make sense to you I'll show it to you in the demo. You'll understand a little bit more Now normally what I will do because it's really hard. This is what we call drive table Right, it's it's well. This isn't quite a drive table, but this is the output from a function I would have to put this in a sub select if I want was not going to use a cte We're going to talk about cte's in a minute And so normally what I do is I grab my text I throw it in a cte So it looks like the table I want I'm probably going to do later And the other reason I do it this way is again. I don't have to iterate any schema I'm simply changing this There's no table action yet going on. It's just function output if I end up with more columns It just is here. I can iterate and then I can create the table that I am ultimately going to want for the final solution Now I want to quickly point this out. Who knows about dollar quoting who's used it in postgres all right Super helpful There's a lot of ways you can go about it. So there's a couple places you have probably used this if you have programmed within postgres That is When you do like a function you have to do dollar quotes Around the function because there's text inside or you have to do a an anonymous function, right? So something that's just you're creating on the fly It basically creates a block of code In this case dollar quoting is really about the text. This becomes a string literal And I don't have to worry about you know escaping characters and escaping character returns It's just helpful. I use it often in formatting So rather than doing a single quote and putting the stuff in and doing backslash or whatever I'll just do $2 signs in case the string And then I have a literal string to start to work with all right. So that's what I'm doing here I'm taking the literal five lines out of the text file I have to worry about the carriage returns or anything else and then I'm feeding into a function in some way Just a helpful hint Hopefully you've already caught one or two things that you didn't know about All right, so let's run through this basic building blocks So I have data I figured out some way to Quickly start to get to it and iterate it And I have lines I have table rows coming out somehow Now I'm going to take that data and do something with it. So there are a couple functions that I think have Maybe you're doing it and you don't fully understand what's happening. So that's why I'm going to go through The first is cross joining. We've probably heard about cross join cross joining is when you Take two tables or more than two tables and you join them with what we call cross join And it basically says for every row in the table on the left of that join iterate every row in the table on the right The output is a product of both tables You know, if there's 10 rows in one table and 10 rows in another table, you're going to get 100 rows We can demonstrate this really easily with something like generate series, right? I'm saying generate a series from one to two Cross join it with another series from one to four and you'll notice that I get eight rows So One iterated through each of the other rows two iterated through each of the other rows Why do I care about cross joints? Well, we get to cross join laterals And this is so tremendously helpful for taking data and iterating and processing it through a SQL query Here's what this means When you cross join when the right hand table of a cross join Is a function. It's a set returning function. I gave you an example earlier like the reg xp split the table I think that's what it was We it is actually considered it is implicitly what we call lateral join And that means the table on the right can reference any output From the table or function on the left And all of that output from all of those tables or functions gets fed back up to the select and it can be referenced This becomes kind of cool, which hopefully you'll see in a minute So again functions can anything on the left and then I call it kind of the reach back because Later in the query I can pick and choose from any of those outputs However, they were processed to make my final query in some way One quick helper in all of my demos. I don't use I don't actually say cross join lateral It's just a lot of typing. I don't care The comma means cross join and again because in this case The right hand table, which is actually the second generate series. They're both Set returning functions, but in this case, it's on the right that comma implicitly means cross join lateral All right, that's just how postgres works. It's in the documentation. You can read it I don't have to say cross join lateral. It's implicit When it's a it's a set returning function on the right And so this is kind of what you can look at it as like on the you know, I have a select statement I'm selecting from table t And then I have a comma And I'm feeding some kind of function in here in this case a string to table, right? So I have string to table as my comma, right? It's a function and I'm referencing table t column a And whatever that is, I'm basically saying split on nothing and it's going to iterate the rows You'll see examples of this in a minute Now I do another comma and because the second thing is also a set returning function It can refer to any of the output above it or on the left as it were And then all of this stuff gets piped back up to the top and notice in my select statement I can reference all of those things as tables and any columns that come out of those functions It's a lot of back and forth Why would you do it? You'll see a couple of examples later It's just a really fun way and really useful way to process complex data as we're trying to transform it So it also can simplify sequel higher up And there's some fun ways if you go and look at some of the advent of code solutions You can actually then use something like values which you can kind of reiterate and reform some of the data A lot of things you can do with it but Then I threw this in the middle Now that we have a function in the from clause, it's going to become really valuable We can any function That is a set returning function in the from clause So it's a table right to returning a set of data Postgres has this unique thing called with ordinality Now that means not only do I get the output of that function But it will simply add an ordinal value with every single row as a new column I don't have to do Order buys row numbers hiring the query anything like that It's much faster and you'll see why this becomes very helpful Again, when you need to know the order of things coming out of that function This is the best way to do it Anything else here? Yeah retains the order of the rows coming out So you wouldn't have to necessarily Do something later to order it to get the row number What that means is you can take something like this which is again that that cte with the the Function inside so i'm saying hey add the row number From reg x split to table and it's going to get me these five lines it works Now in a much larger file The biggest problem with row number as a window function Is it requires it must scan the entire data set to add the row number right it has to do the order by Everything has to be done first and then it iterates the entire data set again to add the row number So you're adding overhead I could instead take the output of that function and simply say with ordinality And then i'm going to get the output of the function that's in this case lines And then the id i just without you can call it whatever you want. I'm giving it an alias of id And I get the exact same thing Check out with ordinality. I'm going to show you a couple of examples And hopefully a little light bulb will go off and you'll see how can be useful in what you're doing And then the other thing I said so we have with ordinality. I wanted to make sure I covered that so this wasn't just thrown out of nowhere One of the other reasons that I've found to start to use these functions and cross drawing laterals Really becomes about what happens up top So this select statement i'm starting to transform it right i'm now taking I need to get those Cards and I want to get each individual card out. Okay, so i'm going to move that to the from clause I can take any row so I uh, let me see this is not my yeah, I do have it. This is not my clicker So this is the lines column out of december 7 Right and so this is splitting it apart. So i'm getting the cards and then i'm splitting the five values Right there and the first row it's three two t three k And i'm splitting them out into three two t three k individual rows Because I need to know the hand it was played in in the order of the cards in that hand But it's still kind of messy and so actually I could take that split part Put it in the from as another returning function So that my query up top just looks nicer I see a lot of people do that like it again It might be valuable and helpful to just clean things up a little bit And then you can iterate down below if you want to change lines or something gets more complex I just throw it out there because I've seen a lot of people do it like you know what like I kind of like that I can hide a lot of things here's an example that I actually had to get some help on this one This is from last year advent of code This is about It's a really complex thing you're removing two ends of a rope throughout a whole grid And you had to track the position of every end of the rope blah blah blah blah So there's an h and a y and x into this And so notice the select query eventually became pretty simple Right I had the ends of the rope and the h and the y the the head and the tail of the rope That looked really nice and I could hide all the calculations down below and some cross joins, right? So I'm just transforming all this data and when I'm getting out with um, a couple aliases Right here made it much easier to just Work on it down below and the query up top was a whole lot easier to work with Then comma and doing all the stuff up there Any questions on cross join lateral sweet All right comment table expressions now most of us Have used comment table expressions. I'm sure but I just want to make sure that we understand a couple points So we can get to recursives. They're often called with queries Depending on the database you come from because they start with a keyword with Most of us do cte's because they are more readable. That's kind of the thing that everyone talks about, right? um, and so Uh reference yeah, so that you're naming a query It's a virtual table essentially with a name that you can reference throughout your query Just note prior to postgres 12 It was they were what we called Materialized so kind of the query to be run it would be stored in memory memorized in some way So that anytime you reference that table it wasn't queried again Now in postgres 12 and above it ended up becoming inlined is what we call it We try to inline it which can help the query plan actually be better But that also impacted some people cte's So you can add the materialized keyword to have it take the old effect of materializing it first You can chain multiple of these together. You're gonna see that And again one of the things that some people don't know is you can actually name the output columns At the cte so it doesn't matter what the columns are named in your query If you want make sure they have a specific name you can do that above So this is what they look like right we have width. I name this query cte one Now the next query has access to any cte above it Right so I can name any table or cte above it and so this query currently only has access to the first cte Now if I add a third cte it technically has access it can reference either the other two Whatever you want to do one both none doesn't matter and then eventually you have to have a final query That at least references one of these cte's to get an output pretty simple right And one of the reasons we do this is again looking back This is a little bit further down the road for this puzzle that I was trying to solve with the Positions and the card numbers and other things. This is a little bit further into the puzzle And if you don't use cte's you probably do two sub-selects in some way to join it together because you're iterating That's just ugly like it's hard to read And then you like the select is actually way up top, but you're doing all these named alias functions It's just as annoying And so we switch this around a little bit We make two cte's and now notice how much some sort of the query is below This is why most of us do it it makes the query itself a lot easier to rationalize about However Readability is one of the primary reasons we cite It doesn't mean that it's more performant right so cte's can become less performant Particularly the more complex they get because you're asking the planner to basically plan a bunch of separate things And then you know put them together and you're not going to get the right row estimations all the time out of each cte This becomes annoying And so it's not always going to be performant. Honestly, it might be less a lot of the time But it's probably worth a hit for a number of the things you're trying to do All right, so cte's great so we can see examples, but then there's recursive cte's Who loves recursive cte's? I love when hands go up who has never used a recursive cte All right, it's there they They feel hard right, but hopefully they won't seem that way in a second So the sequel language is declarative And it's basically batch-based Right, which means There's no way to iterate The the query as we go you get the output And yes, you might iterate it through like a cross join seems that way Right, we're for every row here. We're iterating here, but it's not the same kind of thing as Looping over a data set That's where recursive cte's come in Before we had recursive cte's Sequel was not a turn-complete language But with recursive cte's we can now do complex calculations Which is kind of the marker for a turn-complete language and so When recursive cte's were added we're marked as a turn-complete language kind of cool And it really doesn't look it's it's it's a cte, right? We just did two things we added with recursive So we had to add the recursive keyword and that simply tells the planner One of the one or more of the cte's in this Chain of a query is going to theoretically refer back to itself And that becomes the looping looping mechanism. So in this case cte2 I'm referencing Both cte1 in my select statement. I you know, I didn't want to take too much space and I'm referencing Myself as the cte2 cte And this is going to cause this cte to first iterate and loop over the process that you set up The output of which Goes on to the rest of what's happening We often will hear these called hierarchical recursive or cte's because this is where they're most useful I want to recurse through something like a folder structure, which we've all done We've all seen And so you'll hear this called hierarchy, right? Whether it's Buildings and I used to work in Utility data, so we'd have buildings meters divisions, you know, all these things related and we need to correlate them up to some kind of Tree that we were going to work through So here's how cte's work Recursive cte's There's two parts to it And this is where I always have gotten messed up until I start using them more The first is what we call the initial query. It is a static query It only gets run once and it's the setup For the rest of the recursive cte The output of this query will determine the data types in the columns for this the output So in this case, I had this simple folder that this is assumed. This is the a table in my database I have three columns I selected out the name the parent folder in the size Where parent folder is not I only had one parent folder and so I get one row And I see that the name is folder a That output becomes the input to the next iteration Which I can then join on So when I refer to that's the files on disk is the table files on disk And I'm joining to myself. I called the cte files I'm joining the output of the previous iteration one or many rows In this case, I'm joining it on the parent folder On the table to the name of the output from the previous iteration And that's how we start to recurse And so the output of this This is what the second iteration would look like at this point. I'm sorry. And then we union them all together Right, so we have to union that's that's a requirement for the recursive cte So we take the first one and never gets run again And every iteration takes the output and runs it through the second query And so iteration one I got the first row iteration two I got five rows because all of them had folder a as my parent I'll show you How this completes in the demo now recursion. This is a caution right recursion continues until it's Until it finishes until there's no output or you put a limit on it Well, if you don't put a limit on it and you don't know when it's going to end you end up in an infinite loop So you got to know that you're going to have an output or your cpu is going to go And everyone's unhappy So just make sure there's an end point or add one arbitrarily a where clause that you know is going to end the thing in some way Or a limit whatever that might be and then this is my caution I have like four or five talks like I love arrays. I love range types And I find for myself every time I learn something fun and new All of a sudden it becomes have you ever heard this the saying When all you have is a hammer everything looks like a nail That becomes these tools for me. I'm like, oh, I love recursive ctes now Every puzzle I'm going to find a way to use recursive ctes And then it's like the stupidest thing and I just needed a window function and it would have been solved So just be cautious once you get used to them. You're like, this is really fun Don't use it for everything So all right, let's get to a demo This is mostly for the recording So I'm going to hopefully see if I can put this here it will at least get recorded So I can show you a couple of these things So I just want to run through top to bottom. We have 15 minutes This will take about 10 now time for a few questions I can increase the font just a little bit more So I'm going to create this table. This is puzzle one five cards space in a bit, right? So I just want you to see in action Now in this case, I'm creating the table I'm cheating here. I'm not using copy because I just wanted to get it done quickly I didn't want to have to go out to a terminal for you, whatever I'm just doing a multi-line value insert I get those five sample rows in great And this is all it looks like All right Now I told you about dollar quoting This just means it's a literal, right? So I'm just selecting the text inside of it whatever that string literal is The other cool thing you might not know about is this just can be they just have to match So you can do something like that and it will still work So you can identify your beginning and ending Right so that it's the two dollar signs You can actually put a word in the middle as some kind of tag kind of cool, right? I don't know why I'm taking this out because it's taking time And then again, I often use this in procedures and functions when I need to do a large A long format of some sort like if I'm generating a query Inside of a function in some way that gets hard. I want it to be I write my query and then I want to replace the various things This becomes really helpful. I put two dollar signs around my string And then I can just keep it mentally it works for me and I'm just formatting the literal All right, so this is what I was saying, you know, I get the output Whether I often start like this I take a select statement run it through something like split to table so I get rows of data I can work with Now to use that I either have to put it in a cte or I start to do a bunch of sub-selects And that's annoying no one wants to do that. So I often just put it in a cte So that I can start to look at this like everything below the cte Is where I'm starting to think about how I'm going to transform this data, right? This is just my sample data. It was a one-off. I don't have to create any schema And now I'm like, okay, if this was a real table, I'd say slick star from that table Looks exactly the same. Sorry. I'm using dbeaver If you know the tool and I just have to hit control enter So I'm not hitting buttons that will execute the query that I'm in. I forgot to say that Now I showed you earlier. We could do something like this where I could do sub-select to get the row number And that works I get the row number But If I do, um, sorry, this is in the cte who cares about that But I can do the same thing with that ordinality I still get the same thing out Lots of ways to deal with it. I'm using that sample data and then I can start to iterate below So in this case, we're just going to do the first part Right. I now have this fake table of data Called December 7 underscore dt. I'm calling it drive table. No kind of isn't And I need to get the parts out, right? I need the id that I created. That's the hand Right. So each hand of cards And then I'm taking the first one. I'm calling it the cards. I'm taking the second one calling it the bid And now I have three columns I took that string used a couple functions And I split it apart. Cool Now if I have the data internally, which I do I can now say, oh, well that just would have been the exact same thing but I'm selecting from the table itself All right, so you can see that I took the sample data I got the maybe the initial query And now I could import the data do the select and it just works because I know my sample matches the other input So now I have December 7th. I'm not going to keep using the temporary stuff And in this case, we need to do second parts. So now I don't need to just know the hand of cards Later in the puzzle, I actually need to know the order of the cards So how do I turn a string into something tabular? I pivot it, right? I just break each of those things apart and I pivot it so I can identify each individual card within the hand So I'm going to have two IDs in some way So the first one is the ID from my table that indicates the hand from all of these plays And then I have the card that comes from my split to table down here So I'm saying of the string Split it into two parts take the first part That's my five characters for the hand I'm adding with ordinality Because I'm splitting it to a table. So look at what happens. So first you see I have hand one two three four five down below Look at hand one When I take this and I split it a second time And I add ordinality to it All right, so I'm starting to chain these things together and I can do that because I cross join That string to table it can reference stuff from the actual table data Oops not that I had the highlight and my apologies So now I have the first hand Notice I kept that ID column And then I have five individual cards in that hand numbered one to five This is just building on top of each other kind of cool, right? And and then you can just keep iterating through this now again what I showed you earlier is I can clean up that top Query a little bit I can just move more functions if they're set returning into the from clause Is this the most performant thing to do? No, probably not But it might help you You know come to the solution understand the way these things are working together And so I just simply moved this split part down below And now I just reference it up top Again, I am hitting control enter to see the exact same data set that comes out Does that make sense? All right, let me show you one really fun thing. This is the wordle thing really quickly the exact same principle We have guesses Right, you can do up to five guesses in a wordle puzzle One two three four five and then I need to know each individual letter within the guess So here's what the string look like So again, I'm taking an example making a string literal out out of it And I'm using a different function here. It's called reg X matches That's going to get out The the the data as a an array So notice the guess is an array And I kept with the ordinality the number of guesses in this case they gave five three guesses Now I can take that output. I put it in a CTE just to make it simpler But now I can take that CTE Called wordle score I can cross join it to another reg X matches So the output of the CTE goes into the next function And there I say hey the first item of the Array, which there is only one item in the array And now I'm going to look for any existence of one of these Emojis and split it apart and do with ordinality So now I have each guess Each letter in the guess the order they were Okay, that's cool. So I'm turning this stuff into tabular data And then to finish all out I can now take that tabular data in my select query and I went from emojis to With each guess how many correct letters there were how many partial guesses were correct and how many incorrect And I get down to by the third guess this person had guessed everything correctly Do that across lots of queries and I have this really cool tabular data. I can With millions of wordle stuff is pretty fun All right, so um, I just want to show this part and then I'll stop here all of this stuff is there I try and do my best at putting a lot of comments So you know what's happening if you were to take this example And I show you where to get the data all of that stuff I just want to quickly go over recursive queries again So that you can see it and I'll have one or two minutes for some questions And always feel free to reach out to me So this is the most simple example, right? I set up the initial iteration by saying select one That's going to be the input to the next iteration So value was one plus one. That's going to get me two I'm going to keep doing that now the next input is going to say two Plus one is three and I'm going to do that until value Is less than 10 because if I put 10 in It's going to do one more iteration and get me 11. I don't want that you do it all kinds of ways I keep highlighting stuff my apologies And we just counted one to ten That's cool Now you can do with something at the Fibonacci sequence. It's another one. I like to show right we all do that We all learned it with agile, right? and uh, hey, what is Fibonacci to five five rows so I started with This is my initial query This is what it started as right the zero was to start the First value is eventually going to be the the current numbers one And you go by saying hey wherever the We call this thing level iterate it by one You add the two things together and so after five iterations Uh, you get eight right so the fifth iteration. What is it to 10? And again, this is where if I didn't have this where cause what would happen Well, you just get an out of memory error because eventually the number is going to get so big It's just it doesn't exist for a big game So I can say hey, what is it to 50? It's really big Right, that is the Fibonacci sequence at 50 50th position And the files on disk it's the one I showed you in slides, right? So I have this table It's already there great And so that's what it looks like You're gonna say something I sure I could and I have done that actually Um And so this is what the table looks like again the initial query look like this, right? I got that first row out Folder a that's the input to the next one. We're joining name to parent folder right here If you remember, this is what the parent folder look like Right folder a folder a folder a and so forth. There is there are two folder b's in there So the second output Would be those five rows and now I'm going to iterate all five rows name name name name A couple that I'm going to hit when I get to folder b So the final output Is all the way down to folder b now because I can connect to it though. I can do things like Join parts from the previous query. So pretty quickly I can get something that looks more like a path Because I'm taking output from the previous query and catinating it with output or something from this iteration of the query That's receipt recursive ct's get really powerful and you do a lot of fun things with it In this case, I'm not going to go through it. Here's what the Here's what it looked like in that puzzle a lot of commands going, you know listing it directory file names And you could iterate all those I'll just show you one quick output of it Because it is kind of cool just do this. Yes. I know it's going to drop Yes, I'm going to create that and here I'm going to insert this stuff Now this is what it looks like right bunch of rows And then I can use something cool like regex matches And so I'm taking it twice. I'm going to find anything that's a cd and anything. That's a file and I'm using regex to identify those positions And I go space by space And I can get those rows out and that continues on if you pull this file down You can do the whole example get to the final result and see what it works We are almost at the end. I will show you the other reason I forgot to put in the slides and I will go back and do that One of the other reasons I love iterating these processes with cte's even if it's not performance Is at any level of the cte I can put a select statement in I can check my progress at that point if something's failing I can go back and say, huh What did I do wrong when I then create the next step of walking the file tree? I can Put a select statement there and see what the output is And as I add the next cte if something doesn't work, I can go back and comment it see what the output is rethink it It's one of the other reasons I love using because it really allows me to debug as I go That's basically the end. I'll simply show you this There's all kinds of other stuff go look at some of the advent of code. There are so many functions you can use Once you start to split this data apart and transform it you can go crazy and it's fun Communities where you can help All of these people and many more help me learn a lot of this stuff over the last four or five six years All right, so get in the community if you're not there and you're gonna learn time That's it. We are out of time, but I'm happy to stick around for a couple questions. I apologize And there's the talk again Yeah, I don't see Tatiana here yet. So hey, let's take a question or two if you have If you don't that's fine, too. All right Going runs going twice. Have a great day. Thanks for coming. I'll leave this up for a second Just want to go scan it One off Okay, not a lot of us today. Hi everyone nice to see all of you and I think that it's already Five p.m. And we are ready to start So I know that it's the last talk of the day and you can be a bit tired already So I'll try to entertain you as much as possible and not make you sleep So but if if so, just let me know and I'll try to make it faster So let's start and our today talk is even children can walk with postgres and You know the first thing I need to say a bit about myself My name is Tatiana and I am a CEO of dbber dbber is UI tool for working with databases and It was created and born as an open source tool in 2011. It was like a lot of years ago Now we have like eight million users and you can find them in every country of the world So what we do is a database management database administrative tools Where people can work with any kind of database so Why I'm talking about that because The biggest and I mean not the biggest but very big part of our community is the people who work with dbber and with postgres and they write queries they do different Administrative stuff and so on and what is even more interesting here that these people are not necessary to be technical guys We have more than like 35 percent of users who are not really know what is sequel language They consider database as the collection of the tables and they live quite happily with that and the Question here is how can we help them? Because no one really wants to write the queries or scripts for someone else We want to make these people independent. We want these people to create this by themselves and It's where actually last year appeared the very great thing because open ai appeared and You know, I think that you heard about open ai a lot during the Last year like everyone at all conferences had a talk and maybe 10 talks or maybe all talks only about ai and different language models because it actually affects everything and Like people started the thing that they can do with ai any stuff They can write articles. They can create images so they can write reviews. They can do their job Just with creating Right prompt with ai So today we will look at how good is ai actually for creating the queries So and firstly the simple stuff that we can do is to try chat gpt Everyone knows what is chat gpt to create the query So I did it for you this experiment for you. I opened chat gpt and write a request like Please write a sequel statement to show all invoices from posgress Literally, I Didn't write please but anyway, so as you can see It helped me and it created even two options One of them is with select asterisks from invoices and another one with a list of some columns Good result most likely this kind of select statements will work But you know when we think about how to work with ai In most cases, we don't want to create the request like that We don't want to create the request in a way write a sequel statement to show all invoices. We want something like this Like showing voices and it shows you invoices Nice nice So I did it I write showing voices and unfortunately, it's too small because right now chat gpt is very talkative. It gave you a lot of Gives you a lot of comments, but you can see that actually the queries are the same even if we add the If we switch to the Like list of columns, we can see that for the second Case it added to us invoice date and total amount really nice Our job is done Chat gpt can create queries for you. Thank you guys My talk has ended Okay, not not now not so fast because If we open the fresh conversation with chat gpt and send something like this like show all invoices You can see the result again a lot of words there But the conclusion is no, I cannot show I don't understand what you are asking me about Why so what is the difference? Actually the difference is in context Because context for the AI is a key So what did we do at the first time? We firstly said write a sequel statement To show all invoices from postgres And now AI Knows that we want to get sequel statements Not something not some invoices may wants to have Sequel statement and we have wants to have it for postgres not for something else and Right now if you open chat gpt, they don't have this screen with the capabilities But in the past they had and they had this notice that remember That chat gpt remembers what user said So that's how context was created So it remembered our first request and when we after that asked to Show all invoices it showed us the same select statement So in this case Most likely Now we know how to work with that We can do this by ourselves The only thing is if we want to get the right Select statement if we want to have a Complicated select statement with join different tables and so on we need to provide the right context So what is the context for the sequel query? a lot of stuff I mean it's not only the database type like postgres or mysql or sequel server or something like that It's also a lot of additional metadata like database names, schema tables, constraints Names of columns and so on And then more complex our query than more metadata we need so Because you know no one wants to have some abstract query most of people wants to have the query that they can Just Apply and get the result And for all of this we need all of this metadata So are you ready to describe all of this stuff in the chat gpt to get your query? I don't think so. Yes, it's just impossible. It's much easier to learn the sequel language and write query by yourself Than trying to explain it for the chat gpt So Okay, now we are ready to end the talk. Unfortunately chat gpt cannot help us with our mission Okay, again, it's not so simple. Let's move forward What cannot help us Here I took the example with open ai But actually it's applicable to almost all publicly available language models So we have open ai as a model in the center But every model has different kind of interfaces For open ai, it's a bing chat gpt and open ai api So bing is a browser solution chat gpt is Also like a special website where you can ask your questions But open ai is a special thing that you can integrate to your applications if you want and Comparing with bing and chat gpt open ai is only one that is paid But With open ai you can achieve exactly what you need So you can help your users in our case in debiver. We use open ai api to Close like hide all of this stuff related to creation the context Sending all of this information so users don't need to care about that at all. They care only about the query And Currently in debiver. We have integration with three different ai engines. It's open ai Asia open ai that actually almost the same as an open ai but in asia cloud And we also have a fresh brand new google gemini so They work if we are looking at them From the like technical perspective, of course, they're a bit different But they work very similar especially for users and Even the result that they provide It's mostly like I cannot say that the same You can experiment with different ones, but the General process for all of them is very similar So how they work If you want to ask something from a ai engine you firstly need to determine context Then write request form prompt And get response to show the results Just five simple steps That will give you something in the end or not Why Let's look at them a bit more closer And we'll start from the context The main thing that you have to remember about the context is That more information is better than less information Then more information you will send to open ai then better result you will get Because not all the data will be used for the like Final query, but Sometimes it's can it's hard to predict what will be useful and what not And Actually The number the context is Determined by the number of tokens that you can send to the ai model And what is tokens tokens is just a set of symbols If we'll just calculate it in average 1000 words Give us 750 tokens Of course, it depends on words if it's just like article You can put more than One or like two to one token if it's very long word with the 34 Letters then most likely it will be two or even three different tokens And here is important because as you can see different models suggest us different options for the number of tokens and classical Open ai gpt models 3.5 and 4.0 Give us 16 and 32 tokens to the thousand of tokens Just to give you some understanding of the numbers. I could say that the middle size Postgres database not the enterprise level You can send the whole schema with all constrained In like 20,000 of tokens So what it means that when you work for example with azure gpt 35 turbo Most likely we'll be out of tokens. What does it mean? So it means that when you will send it And if you will try to send more tokens so that it can apply it will say I'm out of token and won't return to you anything So you have to be careful with that and Actually in the beginning like a year ago when people just started experiment with ai model It was very often error that people received that you are out of token because models were small They accepted the small number of tokens, but now look at freshly announced google gmini 1 million I cannot even imagine the size of the database that can feed through this number Sorry Yes Exactly Now when we know Determined our context we can move forward And we can move to the very nice part about writing the request Actually, it's exactly what our user see because user doesn't know anything about the context He or she just creates the beautiful query And what is the great thing about ai is that you can create very humanably readable Queries, I mean that you don't need to talk with ai like you talk with robot I mean we are not talking with robots every day, but anyway so You just trying to talk with ai in the same way as you talk with your friend and with colleagues or I don't know with some teacher You don't need to formulate it 100 correctly You don't need to be grammatically correct. You can write it in any language And actually it's very very useful, especially if you work like in international company when some fear departments in I don't know In China and another part in Mexico So they can create the Queries on What's like Chinese and get the same result that you will get And it's really cool So more over you can help ai to create The query that will work Because as we will see further It's not always that So that ai works nice and gives us 100 correct answer So you can help you can say use this table or give me the The data from this Source or use joints if you know what joints are And actually if you played With for example ai for picture drawing You have some idea How to create the query and it's very very similar to how you create the query for a sequel Scripts because if you create the query for picture drawing like draw the beaver Beaver sitting and the beaver sitting in Mercedes And it's a sunset and High resolution picture and Give me a Best result From the France France cost or something like that You will get a very strange picture in the end because it will try to Cut your sentence in a part and Each part will be drawing some way So don't do this Do something like Please draw me The beaver sitting in Mercedes On sunset on France cost Like you give the full picture the full Sentence that Describe what you want to get in the end and that exactly that you write about the query I mean that if you need to get the report Or like I want to combine the data about my customers and their purchases for the last year That's exactly what the kind what kind of request is expected by ai not something like Give me all customers Join them with the tables with the Accounts and their purchases and something else. No, don't do this try to talk with ai In the same way as you talk with the people That's it So and actually that's where the people work ends and now it's more kind of automatic work Because here we form prompt And I could say that ai currently has two types of api One of them is for single prompt And another one for the chart mode In the in the beaver will have both of them. So I can show you the differences In a single simple prompt We prompt Form prompt through From four parts It's a firstly context all database metadata that we send And I could say that to Mitigate the risk with this Too many tokens that I described Uh, db will just cut the metadata. So if we like too close to the Limit of tokens, we just remove all the other all other parts always provide suggest Choosing the part of the databases that will be used for the creation of the query So after that we send the human readable request and Give ai a task in our case. It will be create SQL statement quite obvious. What else can we ask? Yes, create SQL statement and Because for single query api api work like it's Complete the sentence we need to give a beginning of the sentence in our case. It will be start from the word select So now our ai Model has a understandable prompt to work with But if we work with the chat mode, it's a bit different Because in the chat mode we first of all, we send the same context that we do for the single query But the next stage is a bit different At the first time we also send the human readable request But on the next stages We send the whole conversation Because unfortunately we cannot trust ai memory Sometimes they it's remember Sometimes it doesn't You never know So and that's because why even in for single queries We always send the context again Even if we are trying to request the same database again and again because after like third or fourth Attempt ai can forget some parts For no reason So it's better to send everything and in case of chat we send everything like the previous Request and the previous response that we got from the ai And we actually even mark like hey It was the previous request and this part was your response last time and now it's a new request and so on That helped ai to understand what we want from them Then we again gave the task like create equals a statement And comparing with the single query we don't need to give the first word We can just say like add comment Because we are trying to talk with ai and we are waiting for some comments from that site if we formulated something not well So and after we did all of that we get Response Sometimes it's good Sometimes it's not and Why we can't get the negative response Actually, there are a list of different reasons And I really like this. I was wondering how I can get rid of this error Actually, this is a screenshot from dbver github ticket when we just firstly implemented that so Yes, sometimes people just don't understand what happened and why it didn't work and You know, there is a list of different reasons on the slide like And the first one is like you do not have internet access And if you think that it's a stupid reason Actually, it was the reason of this error on the slide And it's not because the person was stupid. It was because He was in a closed corporate environment And the access to the open ai api was just closed It was impossible to connect and it was impossible to get the result So it could be obvious, but it wouldn't So the next thing these days is not so popular because ai engines became stronger But it's still be that if you're trying to get some results from ai in the middle of working day on wednesday You will need to wait so and The next A couple of things is about Your responsibility. You have to be sure that token that you use is correct And that you paid for access to ai model We have some free models for now at least for example google gemini right now is free You can use it as much as you want. I don't know how long it will stay in this But right now it's free, but open ai api is is not free. It has some trial Like big enough to play with the technology, but not big to use it like regularly every day And yes, you can choose the wrong model. What does it mean? It's that For example open ai suggest you very different models and not all of them can Return you the sequel query And in this case the model just returns you I don't know I cannot do anything and Again, you sent too many tokens. We actually Defeated this problem in db. We're just cutting the Context, but if you use like you create your own application that will work with ai api Remember that it makes sense So Let's move on Even if everything is good And you received the right query It doesn't guarantee anything Because uh Sometimes Even if you sent the correct data Open ai using some I don't know knowledge internal knowledge. It tries to create something beautiful and Like in this case you can see that's in the script that ai created We have like red highlighted word description because this Column doesn't exist Why open ai decided that it has to be there? No one knows because you know we sent the whole metadata information about all columns everything But for some reason if you have the table with Films you have to have description for this film No any choices. So Yes, and here is maybe the most philosophical Thing in this talk. Please be concentrated So yes It's a very very well known Citation from the ancient Time but you know, I don't have author here because I don't know who was that if we Believe the internet every big person in history talk something like that And it's 100 correct when you work with the ai You need to ask the right question You need to formulate it in the best way as possible to get the good result and then Tada Yes, you will get the result It will show you the data and you actually don't need to do anything You just created this query like me like they like show me all Films released like rented on march 25 And it returned me the results So It looks like everything is good, but before we'll go further a couple of important thing first of all about the unobvious limitations And one of them is related to the number of tokens That huge schemas with hundreds and thousands of tables actually enterprises love this kind of databases and they're also the companies who prefer to not like create the database schema by themselves, but use some third party tools for that They also have the same issue when the names of tables Generated automatically and they can contains like sentences hundreds of letters and When Both of these numbers are too weak You will be out of tokens very quickly The second point is about analytical databases because you know uh If we'll take the some Classical analytical database because of the structure of the data it can be the table with Hundreds or thousands of columns And it will be the same story like you will be out of tokens very quickly Just because your database or your table is too big And of course, uh, you know AI works really well with Postgres for example because postgres exists forever Almost forever from 95 I think but anyway, uh, there are a lot of documentation about that and there are a lot of understanding like knowledge about how to create the queries. So AI knows it very well But if you work with brand new Database maybe it's even fork of the postgres Most likely AI Knows nothing about this database and it can be hard. Especially if they can have a special language Good example of that Mongo not a brand new database But very specific has a very specific language script language For querying a year ago. I had this point in the presentation. So no SQL databases not for AI Now they are much better with no SQL databases as well because someone trained them on that But, you know, Mongo is quite popular if you use something very rare I have some doubts that they will support it so quickly So A couple of words about the differences between single request vs chat First of all, first thing that you have to remember about single request that every request is brand new Like you cannot trust memory of AI. So you AI consider it like a fresh request. So it means that if you Formulated it Not in a good way You cannot just add some details You will need to rewrite it completely or just trying to create the query by yourself in a better way But at the same time the query structure is fixed because you remember it works like Completing the beginning like we started from start query from the word select It means that if someone will try to generate for example delete statement It will be impossible because the list statement cannot start from the word select So it won't work It's less flexible but more safe if you want to give it to some unexperienced users for example at the same time Chat mod is more flexible. You can really talk with AI You can say that hey, you can try this. Please add that and so on and it will just collect all this data and trying to apply it Summarize everything to the final statement and the The benefit of the chat mod is also that you can create any kind of Queries insert statements delete statements anything that you need It's not so safe So be sure that you Gave users the right permissions on the database level to avoid any kind of issues And I can try to show you really quickly How it works. I mean the differences between single request and chat If it won't work, I'm sorry in advance, but at least I'll try So we have four squares We have Some database and we will create this in you Editor and try to Create the first statement So oops no connection Try to reconnect No, I believe we have a connection So yes So let's look at the settings of AI first You can see that we have some API talking and I use GPT 3.5 16k for that and I have a checkbox to execute SQL immediately Just to not spend too much time on that So and Here I will ask To show all films Rented on May 25 As I had in my presentation So we'll wait for that and yes, you know, we'll have some comment It's it's my request show all files rented on May Then we have some Comment from AI like we need to retrieve all Films that were rented on May 25 to do this blah blah blah It tries to explain you everything that he did and how he created this query And as you can see we have this issue with the description that doesn't exist So Because of that we said, okay show Description Stupid way to avoid this error Nice Wait for that Ta-da, we have a result It worked But You know what we don't have Actual rental date we cannot check is it May 25 or not Why I don't know let's try to fix it Okay, we'll go here and say like I don't know include Rental date Ta-da I'm not sure that it's what we expected because now we have like Just two columns not Like all of them that we wanted to have So we can try to play further with that and try to formulate the right Description to get the right result Okay, let's try to do this in the chart So Like 2005 it didn't work Because the date Is Incorrect Yes So How can we fix it? We're trying to add some additional clues for him We can say that Release So What I say I I could add something like Release here 2005 No, it doesn't work. Okay, so um Let's try to reformulate the whole sentence Like show Set Okay Okay show let let's start from something like that show all fields Not a lot Not a lot Yes We have we can calculate how many rows we have here just 1000 not a lot so So, yes Yes, yes, he did everything is correct and now we say that uh Okay only include only include This is the films released On May 25 No, it doesn't work so And actually it's it's a different because when we Trying to add some points. We need to understand what kind of points we want to add to get the right result Sometimes it works nice. Sometimes it works like like this so No any like I don't try to defend the some eye eye model But I consider sometimes you can get it very easily Sometimes a chart mode is more convenient. Sometimes it's less It depends Let's return to our talk And a couple more words that I wanted to add here So where else can AI help? Actually not a lot of areas, but they exist It Can get you a general overview of the existing database schema, especially in a database in a chart mode You can say something like could you explain me this schema and it will try to Tell you like what tables you can find there how they connect with each other So and so on So Sometimes especially if you are not very experienced user or you need the additional information you can use that And the same Stuff about the particular object. So it could say with which other tables it's The your table is connected or maybe from where you Received this view and so on so You know for some people it can be useful Yes And the say the next point is about optimizing the queries It's You know, I could say here that It's a dream To send your query that runs with error to some magic tool magic box and the magic box returns you something beautiful AI Won't help you much, but sometimes it can give you like a good advice about what happens And the last point is that generate more data through there insert statements Actually, it's not like please generate me some data with insert statements No, you said like could you add this to my table and it will generate you Or even could you add 20 rows to my table? Yes yes But again, it depends on the database and depends on complexity of your database So sometimes work sometimes not but the worst thing here that you saw the Speed of the AI response So if you need to generate like hundreds of rows you will firstly need to wait for the Generating the script for that and then run it and it will be Like insert it one by one because it will be single insert for each row So for some cases to for test data it can work for some not but Yes, you can try to do that So maybe the last important point here is about Security we need to talk about that And in the beaver especially for that we show this message like in order to perform completion the beaver Will send your metadata to api Is it safe Actually, actually it's a question because you know if you work with I don't know if you are learning SQL And you have a pagilla database Every model new pagilla database better than everyone else And it's there is nothing dangerous in sending this kind of metadata Moreover, if we look at the open AI privacy policy They said that they never ever send our data anywhere and they don't use it for Like starting learning Oh for their models, but The for open AI Azure open AI the privacy policy is even stronger And actually they even provide you the private like instance cloud Where where you can find your model and no one actually has access to this space And that's why a lot of enterprises prefer to have azure open AI Instead of just general open AI because they trust Microsoft More than open AI at the same time google gmini Honestly Tells us that please do not submit sensitive confidential or personal information Because your data can be used for Learning technologies in google's enterprise features products and services Maybe it's like your payment for this free Model, you know one million of tokens you remember No, any other models will give you this opportunity But yes, please do not show sense of the sensitive confidential or personal data But at the same time if we look at this problem from the other side, we don't send any kind of data I mean the data itself are private The only thing that we send is the structure of the database So it's your choice You can choose do you trust them or you not and how valuable is your database schema If you want to share that with someone or you don't So what else do we have? I listed here a few other models Currently we don't support them in dbver. It's a github co-pilot and AWS code whisperer And for this tool the reason why we don't support them because they don't have public API And they are integrated in vscode and I believe in JetBrains idea But if you want to make your own product or use it somehow independently, it's impossible I asked github team On AWS conference about that why they don't have a public API and they answer it something like oh, you know, we Don't trust it for now. So we want to spend some time But you know again, it's up to you. You can trust this answer or you can believe or not But my Go here is to say that unfortunately they don't have at least for now. They don't have public API The interesting thing about Lama 2 Lama 2 is an open source model from Meta And it means that actually you cannot just work with that I mean, you cannot work it in the same way as you work with google gmini. You can take it You can train it and after that you can use it And Currently there are I don't know how many new startups that use Lama 2 as a Basement for their models and for their tools so Actually the first maybe and the most popular for now language model is a Lama 2 and Cloud 3 Actually, I believe no one knows who are they And I put it here because just a couple days ago. I read In some article that comparing with open AI models They give much better result for sequel queries because you know Currently unfortunately, there are no good model for sequel they write more or less okay like Other programming languages like Java or maybe I don't know C sharp or others, but no sequel Yes So and but they said that cloud 3 Does that this work for us for sequel? So who knows? The last question is how good is AI Can AI replace all of us in the near future? Can AI create For us all Like queries and we will need to just live and Keep it for the robots I took the article Actually, it's a joint article of american and chinese Scientists They also tried to create some model But the most important thing here is the last role Because they compare models with the human performance And as you can see the difference for now is is is huge And it's a lot of Like Work ahead for the Language models developers to be closer to the human performance So you can consider You can consider these models as your assistant As someone who will help your maybe less experienced colleague who someone because you know These models are very good for users who are not familiar with sequel but still need to work because yes They can write their queries after some a few attempts after some learning you can create very complicated Queries with a lot of joints like they you can combine three five Seven tables you can add complicated conditions. You can even write self joints with AI So you can do a lot of things And all of this without sequel knowledge Oh, yes, and it's very good for well-known databases like fosbrace and It's a good point to start if you are learning the sequel language and it's not necessary that you are just a student Maybe you switched your database. Maybe you always was like, I don't know oracle developer and now you work with mongo Do you know how to write queries tomorrow? Maybe maybe not and that's where I can help you and here The end and I'm ready for your questions if you have them. Thank you very much No questions. Thank you guys. You were very good listeners really appreciate it