 had tons of data and we didn't know where to put those data. So, and eventually we cooked something like database. It was horrible, it was awful, but it allowed us to analyze what we got from our expedition materials. So, that was my first introduction with data. It was 1991. So, Jason, it is JavaScript object notation. If you are not familiar, it is interchangeable data format. It was probably developed in 2010 or something about that time and essentially solving one problem, communication between client and server where client doesn't need to be complicated engine and the format should be understandable and easily parsable. It is collection of key value pairs and when we say that value, value can be another Jason, can be an array or array of Jason files. So, and it can nest by different levels. So, you have one key, then value is another Jason, then you have another key inside the Jason. So, like a Russian doll. And sometimes in my experience, that nesting can be quite intensive. Like you have like 20, 30 levels of nesting in different cases. I'm not going to spend too much time here why we use Jason because it is flexible because developers like it, it is widely adopted for logging. It is used in microservices, in remote procedure calls, in guest API and it is suitable for loosely defined points. There are tons of reasons to use Jason. There are some reasons not to use Jason but I'm not going to talk about that. It is highly opinionated discussion probably. So, I believe Jason once introduced as a first class citizen in Postgres, probably in 2012. If I'm not right, just correct me. And it was kind of first implementation. I don't remember who said that on that time that Postgres cheated implementing the Jason because the Jason was implemented like text format. It was not binary implementation first. But later it was properly implemented as Jason B and Jason B is a binary implementation of Jason. And what is the difference? As I said, Jason itself, it is a blob of text with keeping the all original formatting and keys and everything else as it was inserted. The Jason B, it is a binary tree with header with g entry and context stored separately and you have like a tree, search tree inside each Jason B file. So, since it has to build that structure, it is parsed every time when you insert the data. So, from that fundamental difference between two different format comes some additional consequences. So, the data storage. Again, I said that Jason preserve original format Jason B transform to binary. When Jason B transforms to binary, it eliminate the placated keys and store only last value for the placated keys when the Jason will store the placated keys as they were inserted first. The operation with keys, the Jason itself as a format, it is a text value. If you want to update it, you just replace it. When the Jason B can be updated, you can add, delete key, update key and the same applies to nested keys as well. And performance wise, Jason really good when you insert the data because you don't need to parse it. But it is not so great when you want to query the data because when you want to search for key value, you need to parse every Jason value and get the keys and doing all the work. So, when you work with Jason B, you pay when you insert the data but you get benefits when you extract the data. Also, Jason B supports GIN index which can improve your search. In some cases really with great benefit, some cases is not, it depends. Again, the GIN tree is built and what kind of data you have inside but still it is one of the benefits you can consider as one potential when you choose between how you store your data, Jason or Jason B. The simple example is I created a table with like two columns, one ID, another is Jason and when I created the table, I put a Jason just to show the difference between how it is different when it is inserted and then I put two roles, one is cast as Jason, the same data are cast as Jason B and you can see the difference. The first, you see the first row has deprecated keys and it is exactly as it was in original insert but the second, the keys were sorted first then the active key has only last value active equal true so it is different in many ways. So it is just simple example how they different when you insert the data and what kind of output you may have when you get a Jason versus Jason B. So what to choose? That's probably question for million dollars because if you know in advance what you want to do with data you probably might have better decision because Jason B is preferable in most cases when you want to update the data when you want to index the data without knowing in advance what kind of path for access you want to have and when you care about extract performance but there are some cases when Jason can be much more preferable. For example, when you have to keep original keys ordered and sometimes it is important and it is important for example if you store log access and all the information related to activity of your application for some government or regulations where they want to see exactly how it was inserted. When you need the duplicated keys some I know some cases where you want to have that all the values for duplicated keys. When insert speed is extremely important for your when maybe extract is not so important as insert to be able to insert so much data and when you need original format by some other reasons and yesterday I believe Chris talking about the new values mentioned one case when you potentially can have somehow new values in Jason but if you try to insert the same data to Jason B it will convert to empty stream instead of null. So that's potential case. So there are one to choose good question. The first question would be what are going to do with data inside the table. So simple example how much you pay for Jason B parsing it is just arbitrary table with about 26 million roles and then I insert those data to two different tables. So first case it is Jason and you can see that I spent about 185 seconds to insert all those roles for Jason B I spent more than two times more. And if I create an index on top of the Jason which is bit three index with some simple path very small one the insert speed reduced by two times and but when you create the same index on Jason B you can see the difference is not so big because when you parse the data you use the same data to insert to the index so that's good for select from the Jason you have 67 seconds for Jason B you have only four seconds. This 16 times performance benefit, right? If you work with indexes it is little bit less difference because what is different is only extracting path when you get the values already filtered by index. So for another select I just add one more condition to select it is even more difference when you talk without indexes Jason B versus Jason. So you have you can see that extract is much faster for Jason B even without any indexes and with indexes that difference still exists. Yeah, question? I'm good question the question was can we index Jason not Jason B? So we are going to talk about that in details later but yes you have you can have create some indexes. Yes it is the same table, same conditions that just so Jason B is slightly bigger when you insert the data. Yes, so the size for the Jason because you have header you have additional metadata on top of Jason when you pass it but it is not always the case. Yes, but if you don't have extra spaces like in that case if you don't have extra spaces and if you don't have duplicated keys you don't save too much information, right? You don't have too much space. So yes potentially Jason B can be less but if you don't have extra spaces duplicated keys or anything else it would take a little bit more time. Space, sorry. All right, couple of words how we work with Jason. Let's step back and think how we work with data in the database. So we first plan a real built execution path then you extract the data from the pages in bifurcache would put it to the data in the memory and you work with the data and you work with data set you join the data set you aggregate and everything else. So that's great. And we don't talk too much how we extract for example column number from the page, right? It is oh we have a column we find it place where it starts we read the data we put it to the memory that's great. What in case of Jason we find that column we read the column but it is not end of the work as for example for text or number. Now what you have to do you have to extract data from that column. So you have one extra operation and how you do that depends that your performance will depend how you do that. So for simple data access like raw path it has the minimum overhead. So what you do it is like pay as you go you go through Jason you get the value put it date on here you get your key here you get your value. That's great we are done and it works really well because you don't do anything in advance you do it straight away during the person and it is good for small Jason with minimal nesting when they say small Jason I mean Jason which keeps inside your page block without toss table that's really good because it is most efficient way how to extract the data. But if you have multiple nesting what happens in that case? So you have you go you get your column then you start to parse your column you find the key and you find the value and value another Jason oh okay and it is already in datum here so you need to go and parse another level of Jason and then you put it here and then you go another parse and you put it here so it is chain of operation which is not really efficient by the end if you have for example four, six level of nesting that is not efficient way how to do that anymore so how you can improve it? All right for example Jason subscripting it is one of the possible way can do it by little bit different way what it does it builds the path to value in advance and then access the Jason and get the value so instead of going level by level in nesting it is going directly straight you pay in advance but you save time on that operations if you have one level or one level below it will not really give you any difference with previous data access but it will help you when the nesting is more like six, seven level on the eighth level probably it will start to struggle as well one fun part so I have a question for you so let me, what do you think what is faster if you see that data equal first name buzz like a bar chart right and then another one is if I put Jason build object which one would be faster what do you think guys any guess one bet on the first one that's right it is second one is faster second one works more efficiently with the Postgres Kerlo that's probably beyond that presentation guys that would probably it is great question how it is implemented but it would take too much time really going through and go through all the implementation it is just one example what you need to think when you work with data when you think about analytics and what you do so that is subscription and as I said it is good for small size and money at nesting when you have big adjacent files you probably want to go to Jason path and when you have really deep nesting like 10 levels probably that will be it is even more expensive than previous one in terms of how you build in advance the access to your value in the key path but it works better with big Jason files but that would be dwarfed when your Jason is grown in size we need to talk how to start right it is it can be in line or use tossed and tossed it is they oversized attribute storage techniques it means another table with index where you store chance of your data by default if your data more than two kilobytes Postgres will try to compress the data and if it cannot compress it to sufficient layer it will put that data to tossed table and why it is so important we are talking about access so when it happens instead of actual column in the page you have only pointer to tossed index and when you use index you will use at least two IOR it is root block and branch block it is very minimum you can use and if you have only one chunk to access it is going to be another IOR operation so you have to use at least three IOR operations to get the data but if you have more than one chunk or if you have to access more than one branch it will be even more so in that case your access to extract the data from Jason depends primarily from the size of the Jason that is deduct dependence it is almost linear bigger size of Jason more you spend to extract the data so that's about tossed and buck yeah, question? you can change it technically it is default how it is compiled but technically you can change how many so the block is designed for four rows marks right and yeah and it is 2K it is yeah let's talk about analytics why we are talking about analytics why you probably the analytics goal is get some business value from the data because without it data doesn't have too much value so usually what we do we aggregate data filter it and present it in some readable format to probably business owners and what we want is work with up-to-date data and we want to combine sometimes with operational data to get the full picture which is quite important and speaking about performance we want to get it fast but we don't want to impact our OTP transactions on operational level database so we want a lot of things not always it works this way so you probably heard the phrase data is new oil right in one sense I support that phrase it was a buzzword for long long time like since 2011 or something yes data is new oil because oil is useless as a liquid you can't do anything with the oil if you buy a bottle of oil what are you going to do with it you cannot put it to the car you cannot use it as lubricant you can't do anything it is just stinky liquid with not really good feeling but if you work and extract good value you can get gas, plastic and everything else from the oil so the same about the data really the raw data doesn't make any value for analytics you have to work we have to extract the value from the data and you apply by slicing, dicing that data using Windows functions ranking, filtering and everything else what do you have for analytics right and when we talk about difference between OTP and analytics it is usually, it is a general approach it is high level, bird view it is single row versus multiple row access it is OTP, DML with entire row like you insert row or update row or doing something else versus very small number of columns limited number of columns for analytics when you slice and dice your data query execution time it is milliseconds versus minutes and sometimes hours, sometimes days by the way and OTP is a thousand similar queries hitting your database and when you talk about analytics sometimes people thinking about query for couple of hours and then run query for another couple of hours just to get the data so it is different way how to think about it why would you do analytics in Postgres there are some engines for analytics like Redshift and AWS, BigQuery at Google and so on the first to get that data in warehouse you have to move the data and data movement especially in some organizations care about privacy and security it is not trivial thing when you move data from operational database you have to think about security, privacy how you mask the data how you work with data how you limit access to the data there are tons of things when you start to move your data it is not like you create a pipeline on POSC in the cloud for in two hours so I solve the task, no there comes security assessors and so on, no, no, no you can't do that that has to stay or has to be masked or has to be removed from the data set and you start to build serious pipelines with security and encryption the second Postgres it is relational database which backbone for operational data set and you want your analytics to be as up to date as possible and SQL in Postgres is quite rich for analytical functions and I know very, very smart smart people doing wonders with Postgres functions and analytics inside database this really, really kind of art how you can do analytics with SQL and those people don't want to really learn a new language no, why, if you have really good SQL language sometimes for example BigQuery has SQL language it is not really SQL it is special dialect of SQL which you need to learn and it has its own limitations and so on so how we doing? all right some changes with JSON the data inconsistency that's quite important you can see that insert date is 2024, March 14 yesterday it is kind of obvious, right but if you have different locale on your server it could insert date 03, 14, 2024 that is different and your application may not be able to transform that data and it can be even worse it is 2024, 03, 12 is it March 12 or it is 3rd of December you never know really because it can be both the data denormalization it means you have a lot of a duplicated data in each row of your JSON file which increase its size and we discuss about the size size is important TOST can kill your performance potential performance implication because of site's potential integrity problem because you don't have integrity constraints on top of the JSON and by the way, by default you don't have any statistics for values inside JSON so when planner tries to parse the JSON it doesn't know what is inside some recommendations I don't call it best practices but it is from experience don't keep all in JSON file when apart from some cases when you want to do that and you know why you do that the idea of having one column table is not the greatest one when you try to work with the data later that's from my experience try to put only flexible part of the data which you want to be in JSON in the JSON column and keep try to keep it small avoid TOST the dictionary data it is dimensions how you want to slice your data for analytics probably better to keep outside of JSON because that data you want to be structured and have for example traditional formats I will show you how it can help in some cases and again when you put conditions how you want to filter your data for example group your data maybe try to avoid JSON parsing in that part because it can slow down your analytics query really, really significantly I would say we are talking about big data sets when every microsecond matter so what can help in general of course you can have Diplica use partitioning that's help a lot parallel and in some cases depending on engine you probably can use columnar store in some cases it helps a lot in some cases it doesn't but it is one of the possible solutions we will talk a little bit about columnar later so let's talk about indexes it was a question can we index JSON values yes we can we can use B3 index for expression type of index when you know exactly what kind of key you want to access by that index in advance I want to access that particular key I know the path I can put that path to the index and create B3 or hash index as matter of fact and use it in your condition to filter the data that works and works really well but it has its own limitation the same for nested keys and everything else so the same expression index if you don't know in advance how you want to slice the data you don't want you don't know the path in condition then GIN index and GIN index works only with JSON B so the expression index is quite simple here the example I want to check how much bottle salt it is data by the way from Iowa about alcoholic beverages if I want to see how many bottles were sold in Wayne County of Iowa I can put the sales data and county key in my index expression and create B3 index it works really well it goes through bitmap index scan then HIP scan to extract the data from the JSON and then aggregate here we go so execution time is 64 milliseconds for that particular query and why I put it here because we are going to check with other types of indexes so I don't want to spend too much going through the plan plan is simple here you have bitmap index scan and index then your parallel HIP scan on the data itself to extract the value and then aggregate all the sum of bottles GIN index it is nothing to do with beverage it is a generally inverted index it has different structure it has a structure with nodes branches and you have IDs for your values for rows inside each branch it doesn't keep any duplicates it use when the pair it split the pages when it ever flows and when you update the data it is quite interesting when you update the data it creates a pending list for updates and it works in combination with pending list to search your data and then depending on your parameters for your instance it will push the updates to GIN index and that push has its own consequences I saw when with highly updatable table sometimes you have huge spike in the activity when your database 100% busy working updating those changes from pending list and then it is back to normal by default it is using json b ops default class it means it supports operation for json b file json b data type so and that syntax is quite simple you just put that your json b column as a GIN and you put close create index using GIN then you run the query and it is essentially behind the scenes you see mostly the same bitmap index scan and then heap scan and so on but it is quite interesting that execution time is slower for GIN index in that case and there are some reasons for that first the GIN index is bigger in size much bigger in size we will see the difference in size later in I have some summary table with results and it is not the same but on the same time the previous index works only in one case for county search it doesn't work for anything else really but GIN index will work for all key values in your but when I say works for all key values when planner will decide to execute the index because it is not always the case just keep in mind if you create index it doesn't mean it will be used every time it will depend on the cost GIN index has its own cost for the planner and planner will see if it makes sense to execute the index when I say we don't have statistics for JSON columns we don't have statistics for column but we have statistics for index and for example if in that case if I put here I have Wain County and if I put Polk County if somebody knows what Polk County is in Iowa it is the biggest one, right? it is the biggest county it has yes it is all the so in that case most of sales in Iowa happens in Polk County in that case planner will say oh no I'm not going to execute index it doesn't make sense I will execute full segments instead so what we can do else to minimize size for GIN index we can use JSON path ops GIN index which supports only limited number of operations with JSON path but will be more effective because it will have less size and access has less cost comparing with GIN index so if we have a look GIN index execution time was 177 seconds the same exact operation so we put condition when creating index JSON path ops here and execute exactly the same query execution time 64 milliseconds it means it is comparable with B3 index but it is not limited as a B3 index by only one path you can use different path as well so in some cases it is probably better solution than just GIN index with all operations it depends how you want to work with your data really you need to know what query are you going to execute so index and path nothing comes for free in this world right and of course if you have index you have overhead and overhead can be quite significant it is insert update, delete penalty is storage it is vacuuming for every operations as well if you update or delete in the table also need to work with the index and we were talking about pending changes for GIN indexes sometimes it is huge spike in in the lot just coming out of blue just because it decide okay it is time to apply the all the pending changes to the index just simple index and path we inserted data in nine minutes without index create simple GIN index and insert the same data to the same table and it takes thirty four minutes that is difference it is index overhead and I'm not talking about the size it is just simple very efficient insert some summary table what it is not really like data you can use but it is for comparison it is neck to neck more or less close the same data the same table different data types different indexes so the build time for B3 indexes it is thirty four seconds quick but again it is small one path for index for GIN index it is quite different you have one thousand one hundred eighty one seconds to build GIN index with default GIN index with JSON B OPS class for JSON path it is more than two times less and you can see the size as well the B3 index it is one hundred seventy four megabytes and by the way the table size itself it is about twenty gigabytes table twenty six million rows so one hundred seventy four megabytes for one small B3 index but for GIN it is four gigabytes it is comparable with table size already in the same ballpark for GIN with JSON B OPS path OPS it is slightly less but still about three gigabyte size so it is storage overhead you need to keep in mind and storage overhead it is not only the storage itself it is vacuuming operations and everything else related to that and they run just two simple queries and the query is relatively the same we used in the previous examples and access in milliseconds average access it is like I run it probably twenty times or so with B3 it is thirty milliseconds again it is milliseconds it is less than one third of a second with GIN index it is hundred seventy eight uh... many seconds with JSON B path OPS it is sixty six again by the way it is JSON B because for JSON it is slightly different data if you built an index on B3 and if you don't have any indexes it is four seconds to the same exactly query the second query adding one more condition to the query uh... it will extract little bit more data in that case your select will be a less efficient but it is the growing data it is just to extract the data from JSON itself for JSON B path OPS it is three hundred seventy five and then hundred thirty nine again difference for JSON B itself will not be so much because it is different path when you for JSON B without indexes you're going through all the blocks anyway it is and depends on conditions whether you extract data related to two conditions or one conditions will not be too much difference and insert here they insert and probably that can you tell it is five hundred twenty seconds two million three hundred fifty eight seconds or one million seconds if we talk about the JSON B insert it is not too much different from JSON B with index right and it makes sense because you parse your data and you just add the data to the index on the same time so it is not too much difference uh... we are almost we have index advisor in alloydb for example at google it can help with non-json value as of now but what we are working and i'm trying to push it because i like json data type to make it working with json and make it uh... more suitable so what you can uh... do we can increase cpu memory allocation put tons of hardware or you can try to use colmar engine and just i know i'm almost out of time but let me steal couple of minutes of your time and talk we have colmar engine and that colmar engine representation of your data in memory using colmar unit and each colmar unit has a dictionary and information about what is stored inside colmar unit also what it can do it push down predicates it to use bloom filter and use cimd to access the data what cimd is a single instruction multiple data access which can be quite effective when you work with the data in the memory so it is like you parallel access when you access sixty four data points in the memory of one cpu tact which is increased performance a lot it can help with aggregation joints and fast pruning for partition tables and everything else it has so essentially what you have you have some dictionary on top of that colmar unit you know what is exactly inside the dictionary it works for most of the data types except jason i know it is kind of oh it is great engine no it doesn't work with jason so for jason data type inside is like a blob and colmar engine doesn't really know as of now what is inside the jason so but it helps anyway for for example if you run the same query with gene index and colmar engine when your data in colmar store it will work relatively the same in terms of performance but you will not have that index overhead when you insert the data which is can be quite significant when when it have the most when the non-jason and jason column data are populated to colmar engine it means when you have your dimensions in the data dimensions outside of jason and work your conditions predicate with not using jason parsing in that case colmar engine can provide really huge boost in performance for example it is just execution plan here the example how different they have two queries and they have gene index and filter on jason column which is like parsing it is something here i believe here that i am parsing on that condition it is 95 seconds execution time when i run that query when i remove the index and do it with colmar only i put that to the colmar store it takes about sixty seventy six seconds not too much difference but you don't have overhead when you insert the data which is can be quite significant or when you update the data but what if you put that counting number outside of jason and use it as a selection here where is it here yeah when it happens it is 341 millisecond it is quite significant improvements in performance and what i did i just put one value outside of jason for analytics and i just run this relatively the same query it is why i think not that big one table with one column jason is not the greatest idea we run out of time i think questions so you can use uh... let me explain how i understood the question can i include the jason path to gene index yeah yeah here yeah you can so you mean yes you can do that and it means you have you want to have index only scan instead oh yeah technically yes technically you can do that but uh... you remember that it will not be for three right it is not going to be for three i mean your your overhead will be bigger yeah but it depends on your data size and how much you're gonna pay for that it is rate of changes yes uh... yeah it is one of the possibility which one yeah but it is limited to on what you can do with that option you can you have only one path you have only one predefined path for that index nothing else it is what what you can do actually uh... how much how many operations it supports uh... so jason b path of supports only three listed operations nothing else it is what uh... how do you filter the data that's the main difference it is why it is small why why it is uh... faster why it is working by this way nothing for free in this world you pay by one way or another for me in column it is a good question and i'm still uh... working with our internal team uh... because i want it and a lot of customers wanted yeah because it is it creates a huge opportunity when you don't need to create indexes on your analytical environment and insert tons of data and on the same time get really good performance yes it is a little bit of a little bit of me which you can download yet engine it's uh... it is available uh... for a little bit at google cloud or a little bit only which you can download and using your environment it does what i mean how it is uh... because a little bit only shipped with the same kernel as a little bit not the storage layer storage layer level is but columnar is in memory yes columnar is memory representation of your existing data it is sorted during the parsing because it parsed the data when uh... you inserted and it is sorted during the parsing the same way most interesting for me it is if we go back oh sorry it is no no you can't yes where is it so for me most interesting that you expect the active to be here yeah but it is mostly like here and then active because only you have to pick it where it is why it is at the end uh... for jason yes yes essentially it can be even worse if you're using tost table good question i don't know out of my head maybe somebody from uh... how can you uh... get the size you want get the size before insert or after mm okay i said i think that i i can't say out of my head but yes i i can see the size of the column yeah yes uh... if you it is how postgres works with any data types right yeah it tried to compress the data if it can and compress data will feed to that limit to kill but it will it will be fine i don't i don't remember out of my head sorry guys good question i need to check no only only rules which cannot fit question so it depends really what size of your data set because to use columnar you need to be able to put it to the memory column store and it is compressed in the memory it is quite efficient but if you have petabytes of data you want to analyze you will not be able to put it to the memory right uh... not hold table uh... but uh... columns of the tables yes permit yeah this population of the column can be done by three different ways you can put it to your postgres column file names for the columns and tables you want to populate every time when the database starts is one way the second you can use SQL function to populate it in anytime when you want or you can try to rely to automatic engine which will dynamically populate or eject some columns tables from the memory uh... based on your workload it works every hour by default and analyze your workload queries and uh... try tries to do the population automatically behind the scenes yeah behind uh... in in the cloud yes it is on top of the colossus you have uh... special engine because aloid b doesn't really right to data files no it rise to wall and wall is wall changes is applied to the storage system you don't have the same checkpoint uh... process in aloid b as you have in postgres you read the data from buffer cache and from the colossus storage yes yeah it is it is applied behind the scenes yeah it is not colossus it is some services on top of the colossus yes the file system colossus has yeah yeah yeah but on the same time you have special layer on top of colossus exactly for aloid b but yes it is using colossus behind the scenes no aloid b omni works by traditional way just normal storage uh... you work the best yes for storage optimization yes yes uh... you can write to me really and i i i will tell to product manager that we have more votes to make that changes i'm trying to push it forward as much as i can but it is big company yeah it depends from many reasons about from g engineering resources it is a lot of things all right any more questions guys i think uh... i need to get bless them first i'm sorry folks it is big company and it isn't but yes it is what i'm going to do i'm going to apply for release of slides and i will try to make it public sometimes they do all right yeah yeah yeah but no sometimes for uh... for advocates like me they they let me to publish it externally if it is already presented all right guys any any more questions do i have slides with emails let me check no no no i don't think i have slides probably i forgot to put it my god what happened i know what is going on here let me let me try to do that i just don't i don't see how it is uh... what is it print oh my god how can i print it that would be great right very first slide yeah here i don't have email here but uh... my email address it is last name at google dot com last name at google dot com yes uh... i i need to get uh... uh... blessing from my uh... management team before but yes i'm going to uh... to public publish it on some somewhere but it is going to be publicly available probably there if i can yeah hello hello all right hello everybody can you hear me let's get started so today i will talk about database schema migration for lazy bones from chaos to heaven uh... it's a talk that tells a story so i will be a little bit short in time so please keep your question for the end of the the talk because maybe the answers are later in the talk a little uh... word about me i'm julien rieu i work for uh... hovh cloud as an open source dba you can find already my slides on my website and you can follow me on muster down if you want today we're going to talk about basics of schema migrations uh... what we do at hovh cloud and especially uh... the context around the situation which was back in the days more or less the chaos we have implemented schemas code c i c c code reviews uh... we have we will talk about automation and inventory uh... to to come to the heaven and a little bit uh... open to the future so the basics uh... we need to talk about the scheme schema is a set of relations so you can have multiple tables with foreign keys uh... tables can have columns and this represent the schema and as business evolves you you have also schema evolutions you have to create new tables new columns uh... new constraints and new everything all the time and schema regression is the process of uh... going from one state to another and for that for uh... the example here we have two tables one with customers and we want to add a column to the customer table and for that we have the details the data definition language which is basically alter statement create drop and uh... those statements that require a higher level of privileges uh... the statements can lock uh... can have a uh... really uh... high impact on the service so you should be aware uh... you should take care when you run them and who we are we are oh yes cloud we are a cloud provider in europe uh... we offer various cloud services and they have one thing in common uh... there at some point uh... direct point they use postgres databases but not only postgres also my sequel the database here are mission critical if one of the database is down the service will be impacted for the customers and we lose money and money isn't really important for the business i'm not talking about the public databases like we we also have offerings uh... of uh... various databases i'm talking about the internal databases that runs every product and i'm part of the team managing those databases a little bit about statistics we have seven autonomous infrastructure around the world and many in europe but we also have uh... in canada and in u.s. uh... this represent hundreds of clusters uh... spread around uh... with a few five hundred servers and more and two thousand databases as i said we have my sequel and postgres and a little bit of mongolibi but this is off topic and the most important point we manage the databases in highly secure environments to mention we have pc i d s s uh... restrictions and uh... this is the goal so databases are really important for the business so we have implemented clusters of uh... servers with two notes in uh... replication in some asynchronous replication and a third node in asynchronous replication which is not accessible by the application we use it for taking dfs snapshots do backup and recovery procedures and we have bunch of load balancer in front of uh... the nodes uh... just for high availability and we use a virtual ip this is a cluster example we have hundred of them at the beginning when i joined the company eight years ago it was more like a start-up uh... at least in the mindset uh... we were every every developers and uh... ops uh... were in the same room and when some developer wanted to have a feature just come to my desk or the team desk to say hey i want to do this can you help me and of course even at that time we didn't allow them to run did your statement because of previous outages so we were on the same open space they come to my desk and can you do this okay we run ssh pcql to the database we try to understand what the developer wanted to us to do and we were in a query and i'm done what could go wrong first where is the database okay uh... i know because uh... uh... by repetition we can know where the databases are are by heart he will stand sometimes we'd misunderstood what uh... developer wanted us to do so we run the query and oops it wasn't that i've run the query here uh... let's deploy the application all you've forgot one of them care and okay let's do this in a transaction uh... without transaction because by default there is no transaction and one query in the middle of multiple queries that failed so let's use transaction okay but at the end you need to commit or rollback if you want but i forgot to commit let the uh... session open and oops uh... take a grab a coffee and uh... oh get page oh what's uh... what's going on and sometimes a query we think uh... would be very short takes a long time and locks everything down and uh... in the end even the permission system could be broke broke because we forgot to set some special rules to apply the privileges on some new relations and the new relations aren't accessible and at the end the developer wanted to blame the dba's okay you messed up sure what can we uh... improve at the very beginning we wanted to track the changes by using ticketing system at that time we had otrs to create okay uh... we have the hello message and eventually some secret statement but we also had jira at the same time to do basically the same so we had multiple ticketing system for multiple teams uh... secret statement are sometimes badly formatted uh... this doesn't prevent from bad copy past so you when you select from the nice what you see what you get uh... uh... editors you can mess up your your secret statement and it's a poor reviewing system so we move on and we decided to use the mask and one thing the developer uh... likes alike is the gate uh... gate repositories so we decided to put every schema in uh... one uh... gate repository per database at first and for some databases we use multiple uh... one repository for multiple databases because some databases share the same schema uh... it could be on different on them and or multiple databases if they are somewhat sharded they share the same table instructor and we have five hundred and thirty four repositories today and to apply those uh... migrations we decided to use SQL migrate which is a schema migration tool for uh... for go but there is a CLI that we can use and it's open source you can use it if you want so from the developer point of view uh... the only thing they need to do is to create what we call migrations those are files uh... SQL files uh... with two steps one for the moving forward which is the up migrations to create the table here and the down migration to rollback eventually so achievement unlocked we speak the same language which is SQL no more transition error so we use one file for one migration and the tool automatically creates a transaction under the hood so there is no need to specify begin and commit it's built in but sometimes we need to disable them because of technical reason for example postgres doesn't allow to run create index concurrently inside the transaction so we can disable it by using a hint the contract with our developer is we focus on uh... DDLs and DDLs only not DML so what's inside the databases uh... in terms of data is managed by the application and you should handle it yourself we focus only on the DDL which is most critical part if you lose data by running the data we can restore the data by restoring from the backups and now we have atomicity and from the dbf point of view we have uh... uh... tool that is SQL migrated first but we have created a wrapper to implement some missing features uh... that applies only for us which is handling the set role uh... statement and a little bit of orchestration around uh... a migration because we have multiple path that we can use the first one is the admin path to declare extensions which is run by super user and the second one is the the common one a lot of databases only has the normal path but uh... we have to handle both which is the normal path is the one that runs uh... create objects using uh... the right privilege it goes like this uh... we give uh... a configuration file and we have multiple operations we can do a status up or down and we have all the migrations that have been applied which is the fine name and when it has been applied so if we run multiple times it will skip the already applied migrations uh... data is uh... in the database there is no it's not in the the file the file contains only the dvl statements this uh... this is a special table inside the database yes where the migration is applied so now we have good permissions and for code reviews we use developers at the time and right now use bitbucket which is this tool chosen by the company so we wanted to have only one single point uh... for everyone so we use it it's a git based collaboration tool and unfortunately it's not open source but it has a lot of features we can use here like comments uh... blocking tasks for example some migration are not finished their work in progress they can block the merge uh... operation by creating blocking tasks and we can review of course by approving or uh... say that the migration it works what's interesting is we can enforce some rules to have multiple approvals of course we want the dba team to at least one member of the dba team to approve so we are blocking the process but on purpose and sometimes the databases are modified by developers that are not owners of the database so for that we require an approval of the database owner so we are safe but it slowed down the process of it for security and there is a UI when we can comment right inside the code it's pretty much like gturb or any anything and we can uh... click ok and merge it's a convention that we try to enforce uh... inside but you can name your file as you want there will be sorted alphabetically but you can name them as you want so now we can provide expertise this doesn't prevent from bad syntax or if we run the queries and the statement directly on the database even if you have a tool it will run what's inside the file and if the file is bad it will fail so we don't want to to reach production with a failing uh... code that has been merged so for that we use c-i-c-d and a little bit about technology c-i-c-i is continuous integration uh... that meant to test locally ensure the software or migrations are deliverable and merge changes into the main branch continuous delivery is to from that branch build a release and continuous deployment is you build a release and deploy automatically so for us we only use the continuous integration and the test locally of a continuous integration part because we we don't trust the software that we use to reach directly the databases to deploy we want to keep control of the deployment so first we have implemented test without any data and for that we use c-i-s it's a tool that we have built to uh... create uh... for c-i-c-d platform it's open source you you you can check this out don't worry the site are already open uh... available so you can click on my presentation and follow if you want and all those links are clickable and how we have implemented it is we have one project for every schema uh... code base uh... there is there are an application layer that we use to create to link git repository so one application is one repository and we have multiple pipelines to test on various dbms and various versions and a workflow to orchestrate multiple pipelines and there is there are webbooks so whenever uh... developer push a branch there is a webbook that sent to cds to test the same and the workflow goes like this we have uh... the ui with all those tests and there is a webbook that goes from the first pipeline and the second one and if we click on this pipeline we can see that we have multiple steps the first one is the stage one with an offline check we will check syntaxy syntaxly or something uh... right inside the files and only inside the files we uh... have a second stage where we run online test cds is about to create docker containers running postgres and run the migration for us and see by the status if it it's okay or if it fails and we also provide iralchemy to create a nice diagram of your of your schema so at the end the and here we publish the schema on the uh... uh... uh... website internal website so that developer can use the uh... that diagram directly inside their documentation so at the end they have a visualisation of the schema and also test and at the end of the process in beat bucket we have either the build is a success or fail and if it fails it blocks the merge operation so if it fails there are also error messages that help us from that releases some work so developers are autonomous in their review if they provide bad uh... migration uh... this happens whenever they push a branch so that uh... when they push they already uh... can access the workflow uh... and later they can create a pull request and when everything is okay so it's early and so now we we ensure that the migration are the same uh... but it's test without data sometimes we run on production and it fails because of data because database store data and the schema can be dependent of the data for test with the data we use our backup system to move the DFS snapshot first on the what we call the filers and our big servers with a lot of spinning disk to archive the DFS snapshots and over SSH and we use the same technique on what we call a dev server in a dev environment where we start a new instance we apply some anonymization procedure or it's not accessible at all and some app can uh... reach the dev environment with near uh... production data so we can use this dev environment to perform a test on the schema migration so there is no impact on the business and we have a pretty uh... good estimation on what should the migration uh... take time on production so now we have test with and without data we need to deploy uh... when it can emerge and we need to deploy on uh... production infrastructure so we already covered the code review part developer create pull request dbaco review emerge and then we can still go on the infrastructure with SSH gate pull and run SQL migrate and for safety reason we are using lock timeout because some statement can be locked even if it wasn't locked on the dev environment because of the busyness of the database of the concurrency so we have implemented this lock timeout to two seconds which is okay for us and we can retry later on uh... on a less busy time and eventually it will be okay but it's safety measure because sometimes uh... even if you have taken all the steps you can be blocked by this so don't time is now under control one of the pain point where we still need to connect to the infrastructure but where so i will talk about the inventory part the first solution we have found is okay we are dba so dba is so we know databases so what we're not creating a database that tracks all the databases so we have created a database with four tables one to register on the databases linked to a schema repository table and databases are part of the cluster and all the cluster have hosts with all the roles this is simple and on top of that we have provided a rest api so we can with a Http call easily find and search for databases every changes has to be declared and it's very easy to miss an event because it was populated by us humans and we can fail so we have to track when we add a cluster, host, database, git repository, remove the same it's very easy to miss an event and it was it was 100% homemade so we abandoned this project in favor of console we use now service discovery console is a distributed and highly available data store working with local agent and it's also open source and it works like this there is a bunch of servers the console servers and an agent on each server we have and the agent talk to the servers, the console servers and say, hello my name is Node1 I'm part of cluster 1 and I have DB1 and DB2 all the time constantly and if something change the change will be detected so there are multiple concepts we have nodes so it registered the name, the IP address and some metadata of the host and services and the databases here and it's secure because there are access controllers managed by tokens and this uses encryption on the transport so we have static configuration for the part that are never moved like the server type which for example a host can be a PostgreSQL server or MySQL server or Filer or anything else this role will never change like and it's a server type and the role can be a node more than just after a load balancer or backup and this will never change and like the cluster to identify where then the number of the host and there is a dynamic configuration for the moving part for example the sub role a node can be at any time either a primary or a replica because of automatic failovers and so the sub role we have to discover it and register it in a dynamic way and of course databases because databases can be moved from one cluster to another so we use a dynamic mechanism here so now Consul has also a UI but also a REST API we can use it problematically but for the presentation I will use the UI we can search for services and where is my database for example here the test database we can search for the test and the test is registered on three hosts the backup one the first node which is a replica and the second one which is a primary so now it's easy to find a database and it won't miss any event so what databases are behind the repository now this kind of relationship is harder to discover on the host themselves so we have decided to use the Git repository and a special file there so when the merge event will happen we will see in the payload this relationship so we have a key databases we have also other keys we have the region here the US and a list of databases it can be test, it can be anything so the relationship is declared here so now we can find our database okay how can we improve that by using automation because now we have the billion blocks we want it to be smooth for that we use Ansible Ansible is that simple really if you want to automate and you don't know where to start and you have a similar infrastructure like the old ways, not Kubernetes or something you can seriously consider Ansible so we have it works with an inventory registering hosts and group of host hosts are the servers we want to connect to or Ansible wants to connect to and groups are a way to group multiple hosts sharing the same attributes and there are a notion of playbooks which are group of tasks running on groups so multiple hosts and there are modules there is a small unit of code to execute on hosts and it's you can write playbooks in Yammer okay once again it's Yammer Yammer is a rare but don't be afraid this is a simple Yammer we for schema regression I would like to deep dive into the playbook and give you a real world example we have one block here to check arguments running on all because we want to fail fast we don't want to connect to the hosts themselves and discover lately if there is an error we want to check the variables before to fail fast then we can update the schema migration and we can target some specific hosts here like the database name here we have groups with the name of the database and we can combine with several primary with the ampersand sign so here we will run the migration on the primary host where the database name is registered and there are multiple tasks one to create directories configuration files get cloned the schema and run the migrations so here is an example of a task calling a module which is a built-in file of Ansible to create a list of directories the state here is the directory and we loop over ATC SQL Migrate and Valiable SQL Migrate to create the required directories we could use another tool to do that but here it's easier for us inside the playbook to do it on demand because we create one configuration file for one database the one we want to run migration and the database at that time is here but can move so we use it on demand and templates you are using Jinja so here we hold we already have Jinja variables and the configuration file is the same is the same then we use the git module to ensure the the branch and the schema is up to date and at the end we run a command to run the wrapper with all the required information in reality it's not a wrapper we have created a module internally to register the times the execution times but it's easier here to show you that you can run commands the wrapper is not open source because it's only a little bit modified to have two migration paths it's to avoid to use two different different commands but this is this was okay at that time right now if we would redo the same we will not recreate a wrapper we would use SQL migrate and use automation to do the orchestration but when the wrapper was created there was no automation at the time and at the end when you have wrote your written your playbook you can call the Ansible playbook command with some extra bars to define the arguments and the playbook and it runs and Ansible can be plugged with console so we can use console to create our inventory using the dynamic inventory feature and it's open source you can use it on your on your and now Ansible needs to connect to the host and until that time the only way to connect to the host is to go through what we call a bastion it's an open source project we have created to manage access via SSH it's like an SSH gateway and it was accessed by humans but we can use it the main feature here is all sessions are recorded so everything you do on SSH are recorded somewhere for audit purposes and we can't connect directly to to server using SSH and Ansible has to use this bastion to be compliant so we decided to build a wrapper we like wrapper but this time it's open source to plug Ansible with the bastion and we hook the SSH executable with the wrapper and same for SFTP to copy some files because Ansible sometimes needs to copy some files to execute them on remotely and you should use SFTP because SCP is deprecated yeah you don't have to use the wrapper if you don't have the same restriction it's easier to not use the wrapper in that case but if we if we would like to get this a little step further Ansible is run by a CLI tool by humans without trustability because or hard trustability because you have to go through the bastion this is managed by another team you don't you don't really know who when it's hard so for that and it's not always available you have a CLI somewhere for that we use AWX because AWX will is able to provide us a REST API and a web interface to manage and execute Ansible playbooks there are a lot of features and the main one are the use of personal accounts when you are in a company you would like to create personal account and you would like to use something existing and we have there is notifications I like alerting we use object for alerting some playbooks are critical if they are not applied so if they fail and there is an integration with a chat so we use WebEx but it's a webbook and we use schedule jobs to perform daily RESTs and of course the basics no inventory management, source control management to get pool the code base of the playbooks and matching credentials to know how to connect to the server, it's like the SSH keys so the web interface goes like this you land on this page you go to the templates because templates are able to create jobs templates are like the playbooks and you search for the one you want here to run the schema migrations and you click run you provide the argument in the survey click launch and boom you have your job that is scheduled and it's success here it's a screenshot of our chat system with the notifications and at the end you can go back through the web UI to see the logs if you want to go in details yeah it's better for the auditing system but it's also a way to provide an API for plugging other systems in an automated fashion because the CICD doesn't talk to adwx we just use it to test it locally we don't use the CICD to talk to adwx we will see a little bit later where adwx is plugged just right after that but here I've exposed how humans use this this solution and as a human we can also use a CLI if you want to stay we don't like UI and it's pretty similar to the Ansible Playbook command and we have the same information right in the terminal but now when we click merge we would like to know that a merge has happened and we want to automatically create jobs and for that Bitbucket has a feature hook that can send messages to a Kafka queue so we use those messages with a payload that we parse and we have created another little piece of software we call it SchemaConsumer that consume events from Kafka queue to create parts it also filters and parses events to create jobs with the right arguments and so we have the code reviews the tests the merge event that goes to Kafka and then in every environment we have an instance of SchemaConsumer talking to AWX having information from the inventory going through the bastion and executing Playbook on the infrastructure everywhere and when we click merge this is our chat for example here it's run on a Canadian infrastructure in some test infrastructure and Europe and Canadian infrastructure all at the same time so we can now apply everywhere and what if something fails because failure can happen we have notifications we are the chat but if we are off or on the weekend we don't see the chat there are on-colour alerts if needed and even that we can miss some notification for example if the notification system is down and we have implemented integrated a check based on NRP which is an AdJos remote procedure executor or something like this we use it for monitoring the host then there's so it was already existing at that time and we use it in the e-sync so every host can detect the configured databases and detect if there is a drift from the schema so we have the alert here and the same with afterwards if something fails so we are alerted so in the end sorry for the notification but I'm a huge fan of Linkin Park and here in Los Angeles they are from here so we are great we have all of achievement we didn't have in the past and it's like heaven perfect right? yeah there are some issues there are problems with highly concurrent repositories for example we have multiple features developed at the same time but non-competing which is each other there are developers create multiple files from various developers and inside the same repository this is an example of a repository in one month there are seven pull requests and one pull request can have multiple files in there so the first solution is to use the same file names so you take the last one you name it plus one and it should be okay okay let's see what happens so we have v02 here and the first being merge win so there is a conflict so a conflict means the developer needs to rebase and developers hates to rebase so what if we use predicted file names we predict in advance what files will be merged so here we have v02 and v03 we need works on v02 and merge the v03 and later we merge the v02 so in reality we have v01 v03 then v02 and what happens if we need to roll back the v02 the tool we use will roll back the latest version not the latest applied so we will roll back the wrong migration so it's a problem a solution for that is to use some a declarative model and not going from one state to another and how you can do this but more you have those objects and you your make everything you want to apply the subject instead there are multiple solutions like Atlas which is compatible with MySQL and Postgres but we don't have tested them and even schema but it's only MySQL only and there are maybe a lot of them out there so if you know one one tool that solves those problems please go tell me after that I would like some feedbacks we also have a problem with granular deployment for example if we need to merge but not everywhere because we would like to test on test environment before and see if the feature we're trying to build is okay instead of deploying everywhere at the same time solution would be to use some to deploy here and if it's okay yes deployed everywhere and uncritical then criticals and no rollback but this is kind of hard to do it right um do we have to implement a Git workflow like the main branch is something like a development branch and sometimes we build some release and we target the release branch to deploy everywhere or another solution we are trying to figure out how to implement this so it will be implemented but it's not really yet and testing on real data this is a problem because we have a dev environment here but it's not available everywhere it's only in Europe in the spirit but not for every cluster so we can't use it everywhere and some application can modify the data so we can have conflict on modification that are not on production and an animation is hard so we would like to trash this environment and for that the solution is to either instead of using the dev server we would like to use the backup server and create instances based on the latest ZFS snapshot directly on the backup host because it does nothing it does only the replication part so we can use these resources those resources to do this operation and we have 100% data of the production with a little lag but it's better than today and the last part because I'm a little bit late is schema consumer schema consumer is 100% homemade it's not open it's written in python, it's okay but we have to maintain it we have to have missing features patched securities vulnerabilities and stuff and there is a solution there is event-driven Ansible that matches exactly the feature we have implemented which is it works like Playbox Ansible Playbox but instead of a human or something that creates a job it looks for an event source and here we can use it to consume Kafka events but you can't consume a web API or different various event-driven sources and this is the same syntax as Ansible so it's easy to move easy to use and it can be used to create jobs on AWS you can also run Playbox directly but we will use to create jobs on AWS where everything is already defined and it's easy to contribute in fact I've contributed two patches to adapt the authentication of Kafka and add basic O2 AWX controller so it's really easy to update and it's open so it's way better than a private solution don't reinvent the wheel so the last word will be for the developers out there SQL is your friend please don't be afraid of SQL it will make your DBAs happy and for DBAs out there automation is easy Ansible is is very easy it's not hard to automate things that you do multiple times a day you should consider this solution and of course this project hasn't been done in in one week or one month it's multiple years of iterations so if you would like to start from now you can be inspired by the result but you shouldn't deploy a complex infrastructure infrastructure like us you should start small and iterate and of course we have made a lot of errors and the errors design the final solution so it's okay to have errors as soon as you learn from them and I would like to thank you for being here it's a great pleasure to be accepted in this conference so I would like to thank the organizer and and that's it questions yeah so the question is is how we detect drift or differences inside the database and the schema in fact we don't detect what's inside the database but only the applied migrations that are not applied only that this is a major drawback because we are not there are probably differences because the tooling system has been deployed after the databases exist so there are probably differences and for that the declarative model can be a solution if you have a table that managed by the system it will in theory it should be able to detect the differences between the actual structure and apply the difference so the question is there is a mechanism in Django migration that solve the concurrency issue and if I have any feedback on this the answer is no we don't have any experience at the DBS side with Django there are some developers using Django and I don't have an answer for that, sorry so the question is do we have any plans to integrate application migration with database migration at the same time and it's a major feature that is being asked by developers today we don't have a solution for that it's in our mind so when we have a solution we will share it of course there is a big problem with that because a lot of applications are written in different languages and there is not a single way to integrate this migration with all those heterogeneous languages the only way we can do that is provide automation and combine with other automations so they can orchestrate multiple things together okay any question? alright thank you everybody and if you have any question I will be around there thank you test, test okay I hate the sound of my own voice I hate having to use a mic I guess I'll give it another minute even though we're right at time I know it's lunch time so sorry you're gonna have to bear bear with me while I talk to you for 55 minutes about Presta nice to meet you I don't well barely for me I never even got into the career just did the degree and I was like no not that oh that's really interesting I want to hear more about that actually after this assuming you don't have lunch plans right away well nice to meet you alright I guess I'll go ahead and get started welcome everyone like I said thanks for bearing with me during lunch hour welcome to leveraging PrestoDB for data success and full disclosure that's like the last time you're gonna hear me say PrestoDB instead of Presto I guess a little background on that so Presto the open source project was forked gosh a few years ago few years back into two separate projects and one named itself PrestoSQL and the one I work on went by PrestoDB and then PrestoSQL changed their name to Trino so now Presto is just free to be Presto and not PrestoDB so there's a little background for you DB is a little bit of a misnomer too but we'll get into that so here's our agenda for the day I'm thinking we'll just go over like basic 101 information on Presto what is it like why would you use it what makes it cool and then second half is like one step deeper so if you already know a lot about Presto you might be a little disappointed you might not learn anything today but I hope that is not the case I'll do my best to to give you a good medium deep dive so yeah welcome my name is Kirsten I'm an open source software developer with the Watson Open Technologies group at IBM and I've only been working on Presto for a few months now actually before that I worked on a handful of other projects like PyTorch and a few projects in the Jupiter ecosystem so essentially it's my job to like go into open source projects learn a little bit and then go out and talk about them so it's sort of like imposter syndrome the job uh... but I wish I had Yehung here to back me up on that uh... he's unfortunately in Coocon Europe in Paris uh... but he's also been with me on Presto for the last few months and uh... works on Cooflow and Node.js and and things like that without any further ado we can if my slide would advance we can get started so in order to like properly frame Presto uh... I like to sort of start with an overview of what the data lake house is because uh... Presto is often the one liner would be it's you know a technology for the open data lake house and since lake house is somewhat of a new concept I like to just sort of give a little background on what a lake house is and to do that we have to even go one step further back uh... to talk about you know two other founding technologies of the lake house so so a data warehouse uh... I'm sure everyone's familiar but you know it's just sort of the the first data store uh... concept stores relational data you know in your nice and organized rows and columns usually and you know it's already been transformed by the time it hits the database so you don't have to you know it's basically ready for analysis like as is unfortunately that also means it's not super flexible you know it's difficult to scale uh... not all data fits nicely in those rows and columns right so sort of as an answer to the warehouse comes along the data lake and of course data lake also stores data naturally uh... but it's a little more flexible you know in the types of data can store so not just relational but uh... semi-structured data unstructured anything any kind of data you want now data in a lake is stored as is you know it hasn't been usually transformed by the time it hits the lake it's just sort of thrown in there but transformation has to happen at some point right like that's the whole point of analysis sort of is to transform the data you know in a lake concept you're worrying about that later in the processing stage so you know that makes it very flexible you know just put whatever data you want in the lake and access it later so very easy to scale but you know you're gonna suffer from a bit of a lack of quality control because data isn't organized all nice and neatly like it is in a warehouse so you know people being greedy and wanting the best of both worlds comes along the concept of a lake house which you know is the appointment to between warehouse and lake just a combination of those words and the intent is to bring the advantages of you know the data warehouse to the the flexibility and scalability of a data lake so you know you can control data the idea is you can control data you know a little more effectively like you can usually do with the warehouse as compared to a lake but you still can use you know object storage and data lake storage which you know could be HDFS and any number of object storage options you can also you know query data with SQL usually as you would normally do with the warehouse but it also enables you to do you know some more data lake like advanced analytics like you know ML and NAI capabilities so how is this done this part can be a little controversial uh... there's no real agreed upon definition of what a data lake is yet uh... at least not a hundred percent but you're usually gonna see these three things so you're gonna want a you know really performance processing layer you're dealing with huge amount of data it's you know we're working at lake size here which is often a huge volume you also want you know the flexibility to query a lot of different types of underlying storage so to do so you are going to have to sort of separate the compute and the storage portions of your infrastructure to provide that flexibility and then lastly uh... and i won't touch a ton on this particular bullet points uh... in this short talk but uh... usually also are going to want something called a table format which is an additional layer of abstraction that it sits in between the the storage and compute and helps you you know organize your data and it also typically provides some more warehouse like features uh... which would be like acid transactions and indexing caching historical view of your data over time things like things that are typically associated with you know data governance which a warehouse is usually superior at so i'll mention the table format a few more times but uh... sort of high-level overview here and then an open data lake house is just a lake house built on open standards and open source technology all right finally that finally brings us to presto so you know at a high level presto is an open source SQL query engine for the open data lake house and it's fast reliable and efficient at scale so you know the query engine part of this is of course that processing that performance processing layer we talked about uh... it accepts requests and executes on them in order to return results from whatever that underlying storage is and it also provides you know the separation of storage and compute that we're talking about which we can see in the architecture diagram at the high level we have at the very top is like a visualization bi tool layer reporting layer uh... and this you know this is an optional layer you don't you certainly don't have to use uh... visualization tool uh... you can also submit requests to the presto cluster via you know there's a c l i there's a u i uh... there's various programmatic tools to do so as well so that's an optional layer the cluster itself of course is in the center and in this diagram arm presto cluster only has a single coordinator and three workers which you know is is pretty small uh... not a great real world example uh... but presto scales really well so you can just as easily you know scale up to hundreds of hundreds of nodes uh... including a multiple coordinator set up as well now the lower layer represents the possible connections to the underlying data and i say connections because you know presto doesn't actually supply any storage of its own uh... hence why presto db is a bit of a misnomer there it's not really a database in the traditional sense of the word so rather it uses uh... at this concept of connectors to connect to that underlying storage and that could be you know you're more traditional warehouse like storage like my sequel postgres sequel things like that uh... but they're also connectors of course to your more traditional data lake storage uh... so to connect to lake storage which again is usually an object storage or hdfs you'll usually connect to it uh... via the table format specific connectors that that presto makes available so having those sort of access like a proxy to the underlying storage but also provides you know those additional features that uh... they come with adopting a table format in general so clearly we have disaggregated the compute and storage here we've got those first two pieces and uh... we'll get a little bit more into that in in the second one or two portion of the of the talk but here's some common some common use cases so you know presto's is pretty flexible it's used for a variety of different things dashboarding is a big one uh... so you know presto makes it possible to do real-time analytics uh... visualized like basically the freshest data and there's no need to you know etl your data from one like dedicated storage system into a dedicated dashboarding database also you know business intelligence that integrates with a lot of bi tools you can do report generation like especially for like cross-functional analysis which we'll get a little bit into the concept of federated querying in the next few slides as well you can also do interactive exploration uh... this is a pretty popular one because since presto is really uh... good with the sequel processing capabilities and and the processing portion interactive queries are probably your your most common use case here you can explore you know data interactively pretty high speed you can do batch processing which will get slightly into uh... not my area of expertise personally but uh... presto does support it has an integration with spark called presto and spark and it's sort of exactly what it sounds like it's presto's integration with the patchy spark so you can submit anti-sequel queries to to spark for batch processing presto also supports you know data-driven apps again due to how fast it is and uh... it's high performance high-speed real-time retrieval capabilities you know it can support data-driven apps pretty well uber being one of them speaking speaking of uber uh... here's like a little sampling of some of the you know commonly highlighted use companies that use presto uh... these are definitely going to be your your bigger ones of course uh... particularly meta who they have one of the you know the largest data lake houses in the world and they use presto in you know several different flavors they have you know a few different clusters that are tuned for different use cases but yeah uber by dance adobe i'm not going to read out all the numbers but yeah glad to be supporting them but they also you know with the nature of open source they're also all really good at contributing back to presto as well and we've certainly benefited from that but we will get more i'm teasing every slide every slide i'm like oh you're more about that later but it's true so here's just a handful of deployment options and uh... you know the the companies in the previous slide often use you know some combination of all of these but you know presto running presto is part of a huddup cluster is one option so whether that's you know an open source uh... or commercial deployment like with you know claudera or as part of elastic my produce you know like you're managed to do cluster with elastic my produce you can of course the i want so you know run presto and and manage it handful of virtual machines or you know baremail instances also uh... amazon has a serverless presto service uh... called a thing that which you may have heard about so that's exactly what sounds like it's serverless presto and finally you can of course deploy and and manage presto and kubernetes which you know i as a teased presto is is quite easy to scale even without the use of kubernetes but uh... of course if you're using kubernetes that just like kicks it even into a higher gear with that uh... and it's just a great option since you know presto is the processing layer we don't have to worry so much about you know the data but we do want to probably make the presto cluster itself as fault tolerant as possible so kubernetes is a great option for that okay so here's a summary of everything i just said and now we get to get into the one or two portion will sort of see like why all of this is and uh... you know explore some of the features a little more in depth and to do so i've created this fake company called company j bear with me so company j is trying to determine uh... the best way to manage all their data needs and they so they do a few different things they're pretty small company uh... but they have a few different uh... hdfs clusters and they use those for a few different types of analysis so they do you know some interactive analysis on this on this data with hive and they also do some batch processing on it with spark not only that but they run some reports on top of a proprietary database that they built from scratch and they query that with sequel well it doesn't really matter what they query it with they query it uh... and then some of this reporting that they do also requires joining data that exists in hdfs with the data in their proprietary database a few issues with this setup that they're finding as they continue to grow so first you know their data in general is growing in size as they grow hive uh... is just proving to be a little too slow for the interactive analysis that they want to do also uh... company jay's analysts are hired for their knowledge of sequel and not necessarily spark sequel or hive ql uh... or any other number of of query languages that are out there and it's sort of you know non-trivial for them to have to uh... translate these queries you know if you if they're doing various types of uh... analysis on the same on different clients and they also just have to you know debug and they waste a lot of time converting that and then lastly they have to maintain those data copy pipelines those you know to copy the data into hdfs they have to you know do some transformation and all that good stuff before they can run the query on on both those data sources at once times money they're finding that they're wasting a lot of time with their current setup and now we get to follow them through their decision-making process as i weave in some more information about presto so the first thing i like to highlight which you know we're all at the linux expo we don't really need a lecture on why open source is awesome but i'm going to give it to you anyway uh... so presto is of course open source it was originally created in i think twenty twelve time frame at meta and once it was matured if they pretty much open-sourced it on github right away i believe that was twenty thirteen officially so presto has really been open source from the beginning and then in twenty nineteen was donated to the linux foundation and specifically it sits under the presto foundation and is managed by the presto foundation so as we probably all know you know linux foundation projects are are awesome because they're controlled by a neutral and diverse governing body so it's not like one company or two companies have uh... you know a monopoly over the roadmap of the project that's of course one of the benefits the road maps driven by the community needs as a whole or innovation tends to happen a little more organically in this way and you know innovations are free for everyone to both contribute and and take advantage of you know they're not reserved for monetization by certain companies also uh... you know by design an open source project is just going to be more flexible it's able to integrate with with more tools since you know when you're looking at a closed governance project you know often there's might be some vendor lock-in of certain features and things like that luckily with open source we sort of want to integrate with the more tools the better to integrate with which which will see some examples of with presto and then of course the community support you know both from those that have been working and contributing back to the project since you know it's inception so you know meta for example in the case of presto uh... but also you know support from companies that are if you're new to a project uh... there's often going to be other companies that are likewise new to the project uh... so you know you get support from from all levels of in all perspectives all stages of the pipeline all right company j they're looking at this list this is again not really an exhaustive list but a list of some some household name companies that that are either adopters contributors you know foundation members of presto and obviously these companies have a lot of you know data management experience and yeah company j likes the look of it but their open source interest goes further than that so recall that they're a small company you know they don't necessarily have the time to build a proprietary solution from scratch so they do however they're good open source citizens they do have the time to to devote a few developers a few cycles a month to contribute back to presto which is the right thing to do so they they have got the time for that and not to get if i may get a little touchy feely here company j their values align with the values of the presto foundation as well so you know as i mentioned i've worked on a few open source projects i wouldn't say i'm an expert in presto i'm more an expert in open source and uh... from my perspective presto actually has a really objectively great community super helpful like very diverse and who's going to be answering your questions or working on a particular PR with you so they they're backing they back themselves up with that one and company j loves that they're all about they're all about diversity and welcoming and being supportive so all right but they're not sold yet so let's move on to uh... to see what else company j is interested in about presto one this is almost sometimes like set as an afterthought but it's a little more important than than you would think initially but presto is anti-sequel compliant so you know the main bonus of this is like i sort of alert alluded to earlier anyone querying uh... the data only has to know a single query language so you don't have to memorize all the different dialects of sequel or you know whatever other query language you're using and what this sort of translates to is presto has this concept of like sequel on anything so you can really simplify querying your relational data uh... and also your unstructured data you can query in like mongo db and there are no sequel data uh... and really anything anything in between so that also applies in fact to this presto and spark that i mentioned earlier uh... so you can run anti-sequel queries on spark with this presto and spark integration uh... and essentially how that works is presto's compiler and evaluator are integrated as a library with spark and then the job sent off to spark's rdd api to execute uh... so basically you can do batch processing but still use your your sequel and take advantage of presto's sequel optimizations so all this to say uh... presto supports federated querying so you know the concept being that you can query across multiple databases uh... data sources with a single sequel statement and we can sort of see a little more about this in the next slide so we've seen that you know the storage and compute is well in decoupled i don't think i have to continue convincing you of that uh... but the way that that happens is using presto's concept of connectors uh... that's really what does the sequel on everything part of presto presto can really connect to any data source that has an implementation of the presto spi whether that's you know your traditional databases like my sequel or the data lake storage it's the implementation of this connector that determines you know what functionality of that native data source is available for presto to take advantage of and you know this isn't limited to open source solutions there are uh... it's perfectly possible to to write a connector for proprietary data source as well uh... some of those have been contributed back to presto but uh... plenty exist you know within a single companies ecosystem and are just used by them and you know i've covered that you can also connect to to various clients as well in addition to you know using the the c l i and the u i and and all that good stuff the presto also supports the ability to you know have user defined functions and user defined data types so basically you know you can customize your antsy sequel by simply supplying a plug-in and you know when presto starts up it'll be fully available to use so sort of highlights that uh... that flexibility in the lots of integration options that you're going to typically have with an open source project specifically to connect an underlying data source you're going to have a file that looks something like what's in this great box i mean this is a very very simple example of a connection that's being made to my sequel catalog but essentially you just add one file per catalog into a specific directory that presto is expecting to find these properties but property files when it starts up and you know clearly you're supplying the information that a presto node will need in order to connect to that data so you know in this example obviously it's occurring using the JDBC protocol on localhost with some authentication credentials but yes properties files of course can get much much more complicated than this and they also tend to include you know additional options regarding you know preferences for how you want the data accessed and and all those good things so i won't cover all this terminology but uh... if you do end up spinning up a presto cluster and giving it a spin you might want to uh... just be advised you know in the data management world there's obviously different terminology for different things and you might want to review that before you get started but i'll save the time in this case alright we're back to company j uh... this is the before should look familiar so again we're having a few a few issues here we'll get a little more in depth you know analysts need to run commands on three separate clients that's sort of a lot of duplication and wasted time converting queries as needed debugging them and all that good stuff and they saw that Intuit company that owns TurboTax uses presto to save like thousands of hours on query conversion per year so they want to do the same but also you know they have to maintain those data copy pipelines and that is pretty expensive not just you know resource intensive wise you know engineers have to maintain those pipelines and reimplement them occasionally in the case of you know api changes their system upgrades uh... but it's also computationally expensive so you know it can be hours or days it takes to copy that data into a new source before it's queryable and you know while it's being copied it's also using a lot of network bandwidth as well that's a lot of waste considering that they only run this join between these two databases every so often and again hive is slow that's the you know poor hive they get so much so much crap for being slow but I'm sorry I'm also piling on to them so they need uh... they need a little little bit faster uh... analytics and they read that you know meta uses has replaced their hive with instances of presto and the presto hive connector so they want they want to take advantage of that too so here's sort of the after after they decide to go with presto and you know admittedly this is not the best representation of how presto and spark works uh... but it's the simplest I could get it for the purposes of this but essentially as far as press on spark goes uh... prosto's query parser and planner is used to create an efficient query plan enumerate you know data splits and things like that and then sent off to spark with the other two data sources it's a little bit a little bit simpler you know so now we can use whichever presto client we so choose to query both the hadoop the lake data and using the presto hive connector as well as the proprietary database that they have for which they wrote a custom presto connector most importantly perhaps is the lack of the data copy pipelines that they need to maintain uh... so now they can do a federated query that hits both the you know hdfs data and their proprietary data at the same time they can join they can do a join with a single statement no need to maintain or reimplement any pipelines we've circumvented the need to use hives processing engine we're still using the hive metastore so presto does access the hive metastore to do its query planning and then it uses the hive connector to connect to hdfs under the hood last thing i will cover is just a little bit more about why presto is considered fast and of course this is going to depend on it's it's a complicated topic right there's always ways to tweak any particular system to make it work for your use case you know presto's query query analysis is really the key here so presto is you know as we saw in the architecture diagram built on a distributed uh... architecture so it's sort of on the you know the massively parallel processing model and this is what makes it possible to run uh... hundreds of concurrent queries on you know scale up to thousands of worker nodes in the bottom right of this diagram i tried to indicate this you know that workers presto workers are working on their tasks in parallel and each uses whatever connectors relevant in order to access the underlying data source so for a federated query that data and that connector is going to be different uh... or can be different but that is of course not always the case presto also does as much processing as possible in memory so workers execute their tasks in memory but this also includes uh... an in-memory streaming shuffle that occurs to exchange data between nodes so instead of and here i sort of tried to in the dark blue and the light blue are the stage and tasks respectively uh... so in between stages rather than writing out uh... you know the result of that stage to disk instead it's just streamed between nodes at least you know as much as possible speaking of query stages presto really has a very efficient uh... stage and task management process that uh... essentially starts when you you know when you submit to your SQL string and and an anti-lr parser on the coordinator you know parses that uh... converts it to an abstract syntax tree validates it uh... and then finally breaks it into a logical relational tree which is the dark blue and the light blue stages and tasks respectively and part of this optimization also includes a lot you know what would be considered query optimizations at the task level so you know based on how the query is formed presto has several built-in uh... rule-based optimizations like and supports predicate pushdown and things like that uh... it also can do cost-based optimizations depending on the data source you're accessing uh... where it can you know do join reordering and and all that good stuff there are also a handful of caching mechanisms built in and again this is just gonna depend on what connector you're using and whether that functionality is available for the connector you're interested in but there is a high meta-store cache that uh... could store certain metadata for future requests like you know partition information and things like that there is a list file cache you know which as you can imagine stores uh... file paths and certain attributes on this files to avoid repeated file list operations in certain data lake storage systems there are third-party caching options so again that that extensibility of open source comes into play here at the company alexio has a lot of and they have built and contributed back a uh... sdk cache that works on the presto workers they also have and and and another cash that can be used to do some more it's more of a service that you can integrate with presto and finally you know as i sort of hinted at there are also a lot of ways you can tweak your resource management have you know resource groups and uh... lots of different configuration options so depending on what your setup is there's an option for you and this is just a screenshot of wanted to call out because it's sort of a cool feature especially if you're new to you know some of the the benefits of presto provides so on the ui you can access the live query plan uh... and essentially it it breaks down whatever query submit you can see the stages that you know it's going through and all the tasks that are part of it and uh... it's interactive so it you know it updates in real time based on what's been completed and what hasn't probably could have chosen a better example here that actually showed a join and some branching and some tree action that's just an exercise for the listener so you can go check that out and uh... the last thing i want to highlight in terms of you know the the processing power and the the fast processing is uh... presto is really has you know performance top of mind really above all else uh... so there's a lot of work currently being done on presto simo uh... also called priest press on the locks and that is an implementation of presto worker nodes in c++ rather than java which is the standard right now the the idea being you know all node communication in presto between workers and what other workers and workers and coordinators is done via rest in points so you know there's potentially a lot of cpu time to be saved by avoiding starting at a jvm process on hundreds you know hundreds to thousands of worker nodes so eventually presto simo will be the standard that comes with with presto right now it's still java but uh... yeah there's a ton of a ton of movement especially uh... contributors from meta and uber are working on this a lot as well as ibm and yeah it's a it's a hot topic in the community right now alright so this is the last time you'll see company jay they're happy now they've integrated even even more tools you know now that they have a unified view of their data they can connect to a visualization tool they go with apache zeppelin another open source project uh... they also you know since presto has a uh... apache iceberg connector that's being also very popular getting a lot of uh... contributions to it recently they've decided to adopt apache iceberg as their table format uh... which really helps them manage all their data organize it a little a little better and you know provide those asset transactions and governance capabilities to get a historical view of their data and all that good stuff so yeah company jay has their dream barbie data lake house uh... and they've saved their employees a bunch of time and doing so so one last thing i want to highlight you know presto has a few you know there's of course a few ways you can you can try it out if you're interested but uh... just for the last year we've had an official help chart so it's a little bit in its infancy uh... there's three deployment modes just a single node mode where a single node is both the worker in a coordinator uh... there's the cluster mode which i believe is one coordinator three workers and then there's a high availability mode which uh... you know includes multiple multiple coordinators and a lot more in case of failure scenarios uh... but my ask to you is if you're interested please check it out because uh... you know we're always looking for people to break raise issues race questions so we can improve the documentation feel free to open up your if you'd like uh... but yeah it's it's new but it's exciting and and we definitely want to work on that and same with presto overall you know like i said i think it's a great community so if you've ever wanted to if you haven't yet you know contributed open source and you want to presto is a great place to start in my humble opinion uh... join the slack uh... public slack you can ask questions there same with feel free to email me if you'd like connect on linkedin uh... but i will open it up for questions now full disclosure i i'm not the expert so i will do my best and if i can't answer them i will ask the slack channel for you thank you thank you okay okay no not at all that's a great question so uh... the question was does presto support writes back to the databases and that is depends on the connector is essentially the example the uh... answer to that so uh... a lot of i would say most do so like you know you can write to most sources i know for like one example i can give is if you've heard of apache hoodie they're another table format and right now there you know the connector is in it's adolescence i would say and you can read but you can't write back uh... but you know on the roadmap would be like yes that we do want to support rights for this connector it's just not possible at the moment uh... so it's going to depend but i would i believe in my experience i've not there's like thirty plus connectors uh... and i only really ever work on two of them so uh... i don't know if i'm a great resource for this but i would say in my experience most support writing back most it's not yes exactly okay i'm not familiar so the question was uh... can you compare contrast between like uzi airflow uh... and presto and and text like that dremel right um... i can't i mean you know i'm familiar with airflow but it's one of the situations where that's like a big piece that i intend to understand more about presto myself more like the realistic use cases of why you would choose one over the other because you know at least i'm also not only in my new to presto i'm sort of new to data management to begin with uh... most of the other workers on like applications so you know it's i wish that's that's a big piece for me essentially i don't have a good answer for you i very very similar yeah uh... it's oh sorry yeah the question was compare between presto and trino uh... so yeah trino is what was previously presto sequel uh... the the overarching i don't know if this is like fully agreed upon i don't know if trino agrees about this uh... but i would say at a high level i mean obviously they were forked even very recently and think it might have been like twenty twenty or twenty twenty one that they were you know forked from the same base repository presto has ever so slightly more of a focus on performance and trino has ever so slightly more focused on flexibility uh... so like trino's support for connectors is further along uh... and presto is further along with like you know with all the work being done on presti-cimo and just you know top of mind if you're submitting a PR to presto people will be questioning like have you have you made sure that all of this is going to is optimized and and things like that not to say that you know and and trino they're not caring about that but um... but yeah but that's my impression is um you know from the last few months of what's each one is good with i mean of course there they are very similar essentially at the heart of it almost exactly the same yeah uh... the question is like how i mean particularly with no sequel databases and and maybe you know in certain scenarios where you want to submit a query to do something and you don't you want to avoid like a full scan of the entire collection and things like that so that's again like a connector specific problem that has to be solved and when it comes down to it often the bottleneck will be in something like that like you know a a constraint on the and that the underlying data source has on how it ultimately you know grabs data for you um so you know that is still a concern um and it's going to you know again depend on like some connectors have solve for certain problems like that and do some of those optimizations and ensure like okay well if we can avoid doing this we will and will like slightly tweak the plan so that it avoids things like that it's not a guarantee though you know it's sort of going to again depend on whatever the implementation of that connector is because that's what's really responsible for you know translating between the two very different styles of of querying um so I mean I mean of course the the connectors are implemented with that in mind and and will do as much of that optimization as possible but um you know you will still have sort of a stop gap there where it's like a well but it can't do like that you know it'll get as close as possible but it can't like you know solve for every scenario I guess you know I mean I don't know which I I full disclosure I mostly work on the iceberg connector so um you know your no sequel and and typical like warehouse structures I'm not as familiar with yeah uh the so the question is you know like if you're not working at maybe like a lake house scale is Presto still going to are you still going to see like the benefits of of adopting it and um I would say yes you know again it's gonna depend the the cop out answer um but there are definitely companies that do that and have like well what would be considered like relatively small databases um now like if you had a small amount of data and it's only stored in one data source then you're like you're really gonna have to do a lot of benchmarking to like figure out like okay is Presto really going to help me with this um it might it might not um you know the more data sources you add the more relevant Presto is going to get and also the more size you know I mean to a point like of course there are also other tools that are even better for much much larger you know scale things um but depends on your setup is it lunch time? well thank you so much please feel free to reach out to me if you ask a question on Slack I'll probably be answering so you might see me again have a good one test one two, test one two, this sounds about right can you in the back hear me as I talk a little bit welcome to scale 21x test one two, ooh this is real nice that sounds pretty good so I need to adjust the handheld a little bit more but this sounds about the right level because I'm talking comfortably and we have a nice amplification so this is not working is it working? can you hear me? okay we're testing out this microphone is it working? test one two, this one's working now too alright so the microphones in the room sound good test one two, this is Michael Starch on the audio we are just doing one last AV test while you come in making sure our levels sound good does it sound good on the remote feed? okay so it sounds like we've set all these rooms uh good luck with your presentation thank you do great and I'll hand it back to the room proctor to get this track started hello room proctor we are all set we will start in five I'm just testing the microphone and saying shit right can you take a couple of pictures uh I'll just stand here maybe grab uh you know that restroom no thank you so good to start shenran yeah okay hello everyone thank you for coming in right after lunch I hope that you didn't have a very heavy lunch because you're just gonna fall asleep over here uh okay so the topic of our conversation today is postgres performance for application developers um I probably need to re-emphasize that this talk is uh focused towards application developers and if you are already a database expert or a postgres expert you will likely know of all or most of the concepts that are going to be discussed over here but the key point over here is there is no magic button which is why we need to understand these concepts my name is umair shahid and if you do a select star from me this is what you get uh I have flown in from Islamabad Pakistan to give this talk and do a few other things in the uh region and uh I've been in working in the postgres space for more than 20 years started off in 2003 and uh have uh have spent time with major postgres providers uh started off at edb went on to open scg and second quadrant second quadrant got bought out by edb so um right back to square one and then onwards to prakona and now I'm running a professional services organization focused on postgres by the name of dramatics with the mission to help businesses scale postgres reliably for mission critical data but that is not what we're here to talk about so um the the the idea behind this talk is to enable and empower developers to be able to leverage the postgres database in a way that will help applications scale in a way that when they bring their database to dba's the dba's won't smack their own faces and you know to help ensure that there is a longevity in the life of the applications and as users and data increase the database can keep up now what we're going to cover during this talk is um how the adoption of postgres is expanding all over the world how the cloud providers or the the the move towards the cloud is driving most of this adoption the difference in kicking things off or starting off an application and actually scaling it in production and then we're going to cover four broad areas for scaling postgres those being query and sequel optimization performance features of postgres some architectural improvements that you can uh you can make and configuration parameter tuning uh that is easy and intuitive to do now i i created this graph with the help of two popular and well known publications one is db engines which uh as as a database experts and folks in postgres we're we're very keen on coding because well postgres tends to be the dbms of the year pretty often and you know it was the dbms of the year in 2023 as well but what we have what i've done is i've cross-reference db engines rankings with the developer survey that stack overflow does on an annual basis and i've tracked so i took the top five databases from db engines and i tracked their popularity over the course of five years based on stack overflows data and this is a graph that that resulted in it so postgres was pretty popular to begin with so if you if you go back five years it was still it wasn't number one but it wasn't too bad it was at at number two and my sequel had a pretty big lead over the other databases at number one and as as the years progressed we see that pretty much every other database is on a declining pattern postgres being the only database that is on an upwards trajectory in 2023 reaching to be the most popular database amongst application developers now the key thing over here is that the way db engines tracks popularity and the way stack overflow stack it tracks popularity are two different kpi's and two different measures while db engines is focused on database admins and in general database discussions stack overflow is focused on application developers so let's let's combine this with this study that gotner published in the beginning of 2023 i don't have the updated version that i'm sure gotner must have published for in earlier this year as well if somebody has access to it please please do share but i'm fairly certain that it's going to be very similar there are a few key things to notice over here one is databases showed a 14.4 percent growth that equals to an 11 billion dollar gain for the year and 98 percent of this growth actually happened in the cloud so the 98 percent out of this 11 billion gain took place in the cloud and 80 percent of that cloud is covered by aws microsoft and google so interesting stacks over the stats over there but what what this shows is a postgres is now extremely popular or the most popular amongst application developers and most of this development most of this deployment is taking place on the cloud right and then the question becomes what do you do when you want to scale postgres in the cloud and there's a simple answer to it swipe your credit card right that that that's what everybody wants you to do right you swipe your credit card you get a bigger machine and there you go you have scaled your postgres instance but you know what what you actually end up doing by by by swiping your credit card is you're just delaying the inevitable no matter how big of a machine you get you are going to eventually run out of resources if your database is not designed properly if you haven't architected your solution that is meant for optimized execution so that may not be the best solution even though it might sound very intuitive now another aspect over here as as applications are developed and tested in sandboxes and well other environments this is kind of what the test environment looks like you know you you drive a race car over here you can test it out you can really test out the cornering and the speeds and the bursts and handling etc but production actually looks like this this is an actual picture from my hometown right and believe you me if you're stuck here it doesn't matter how fast your car is you're not going anywhere so yes your fast car probably performs really well on this track it won't do anything here so in order to make sure that you're able to leverage the power that you have from your machines from your database there are many different aspects that need to be optimized and unfortunately this button does not exist or at least i haven't found it so far if somebody has found it or if somebody knows how to access it please let me know and but usually as we work with our customers as we work with clients this tends to be what people are looking for for some odd reason each time somebody comes to us with a performance problem the question usually is but there must be something that i can just switch on and everything is going to be fine there must be some tweak that i can do to make everything better and and that's what this you know this this unicorn like button is that that people tend to be searching for it doesn't exist i don't know where it is but but we cannot rely on make everything okay and hence we come to the four aspect that i want to talk about during this conversation today and those four aspects are focusing on optimizing your queries and hql certain performance features that you can leverage within Postgres the architectural improvements that you can make in your Postgres deployment and some parameters that you can tune again the important aspect over here is that these are high level abstracted concept that i'll be talking about today easy to understand easy to use you don't need to be a guru in databases to be able to do this uh application developers are very smart people very easily this is stuff that can that developers can handle you don't need database expertise or you don't need db is to be able to do this but it is important to take care of some of these things we'll start off with query and SQL optimization and go through a few basic tips and tricks that you can leverage the first thing that i'd like to share is about pgstat statement now this is an extension that is included in the postgres distribution so if you're using postgres if you have deployed postgres even if you don't know about pgstat statements let me assure you that it exists on your system it exists with your database and what it provides is a view into statistics of SQL statements that are being executed in the database and typically when we start analyzing a slow performing database in any production or any customer environment there are two key factors that we look at one is what queries are taking too long to run longest running queries and the second is which queries are running very very frequently so the reason behind the first of course is that if there are queries that are taking too long to execute you want to make sure that you optimize them and if there are frequently run queries you want to make sure that they're as efficient as possible because you know they are being executed very very frequently and in pgstat statements you can take a look at mean execution time for longest running queries this will give you an insight into what queries are taking a long time to execute and calls to take a look at which queries are being called most frequently and two other interesting stats that you can analyze and look at in pgstat statements are standard deviation and io intensity standard deviation because if a query is taking very long to execute at one point in time and it executes very quickly at a different point in time there's something wacky going on it probably needs some investigation query performance should be fairly consistent and the standard if standard deviation is big something needs to be investigated over here and io well it's an expensive operation it takes a round trip to the disk and that round trip is expensive and if there are queries that are io intensive you want to try and see if there is a way to optimize them such that they leverage cache some that's you know they leverage the memory availability and minimize the trips to the disk and a couple of things to watch out for when using pgstat statements a it is off by default which is also why a frequent occurrence that we come across is we ask a customer hey have you turned pgstat statements on and they go what is pgstat statements so it's off off by default which means that if you if you don't know to to look for it you will probably not know that it exists so you need to turn it on and secondly the data is aggregated from the time it was turned on or from the time it is reset so you don't have the ability to time bucket the data with pgstat statements it's accumulated data and there's a small cpu overhead when pgstat statements is turned on it's approximately four percent this is based on a few studies that have been published and you know based on best practices the overhead generally is small enough that is it is negligible and the benefit far outweigh this the small overhead some tooling that you can use to visualize and to analyze this data better pgadmin gives you the ability to to visualize and you know in in a GUI interface give you access to pgstat statements and to visualize that data pg badger doesn't use pgstat statements directly it it analyzes the log files but it gives you very similar statistics and it's a tool that will help you visualize through graphs and charts of you know how your queries are performing and how your sequel is running in the database a combination of prometheus with grafana prometheus using postgres exporter to export data and to export these statistics and grafana to help you visualize can also be an option and if you're willing to go towards commercial tools these are some of the popular tools that we have seen used in production pretty reliably you can use datadog debiver or new relic there are others as well this is just an example that i have put up now moving on from pgstat statements talk a little bit about explain plan now the way postgres query planner works is that it does a cost-based analysis of the different ways a query can be executed in the database and which plan is the least costly right so think about um you need to buy uh something you will research as to the the the specific the specifications of what you want to buy which you would also research as to where you can get it cheaper let's say you're buying a television you've already decided what television to buy uh you will probably want to buy it from the cheapest source because well you don't want to spend money that's what postgres wants to do as well doesn't want to spend a whole lot of resources on executing the query so it assigns cost to different to different resources that are available to it including cpu and memory and io and based on the cost of executing a particular query it decides which plan to use i'm going to elaborate on that a little bit and you know help you understand what these costs are and how to control them but broadly speaking it's two things that are involved in deciding the cost of execution of the query one is based on configuration parameters that will help the query planner determine how much cost is going to be involved for for example fetching an index seeking data from memory versus seeking it from disk how much expensive a sequential scan is going to be versus an index scan etc and the query planner also gathers statistics from the database in order to understand for example an estimate of how many rows a table has how wide a column is for for a table and based on these statistics and configuration parameters it then calculates the cost of executing a query key metrics in an explain plan i'll come to an example it's just a bit to to help you visualize what this all means as well but key metrics that you want to watch out for in an explain plan are cost rows and width cost of course is calculated by the query planner it's not an absolute number so don't read it as the number of milliseconds that the query is going to take or the number of seconds or in any way it's it's a number that is relative in its nature and the query planners use that relative nature to determine which cost is higher versus lower the number of rows a query is going to return and the width of each row that the query returns these are the key metrics to watch out for in an explain plan now the key outputs that an explain plan gives you are how it's going to scan the data how it's going to join the data how it's going to aggregate it and how it's going to sort it these are the four key areas that it does when the way you run an explain plan is that you type in explain and then you add the query in front of it and that will give you the plan for that query if you do an explain analyze it will not only give you the plan for the query it will also actually execute the query so if you want to find out how long it's going to take to delete all the data from your database do not do explain analyze please okay now an example just you know a random example nothing nothing very specific about it but three tables involved over here order order details and products and what we want to do is we want to understand what are the top five selling products of 2023 and what they're you know what their sales looked like so we pick up the product name we sum up the the multiplier of the quantity and unit price of the total sales we join order and you know with the product ID between the product and the order details and we group by product name the where clause talks about the whole year of 2023 we order by total sales and we limit by five so this will give us the top five sales or top five products by sales for the year now if we run and explain on this and the the tables are currently not indexed if you run an explain on this query this is what the output is now it looks like a bit of a scary output if you if it's your first time looking at an explain plan it's a little hard to decipher what's going on over here there's a lot of words and numbers in there but let's just you know go back to what we talked about in this slide three things that are the key metrics in an explain plan cost rows and width and what you need to look for are scan join aggregate and sort now let's look at the cost rows and width there's a cost label in pretty much you know most of the rows and within the print the parentheses you've got rows and width what this is saying is that the startup cost is zero and the total cost to execute the sequential scan is 225 2,000 rows are going to return and each row is of eight bytes essentially that's what it's saying so these are the three things that you need to watch out for cost rows and width now the reason why I say that you need to watch out for is that if at any place you find that the cost is turning out to be too high or the number of rows that this returns is higher than you expect or the width of the row that is returns is higher than you expect then there's something wrong with the query let's say you expected the query to return a thousand rows and the explain plan shows that it's actually returning 10,000 rows you're retrieving more data than you absolutely have to which will impact the performance which will make the database go slow so you want to make sure that that doesn't happen this explain plan is pretty it's nested so the innermost segment is the one that is going to execute first and then it goes upwards so when the startup cost of this guy over here is zero and it goes on to 225 this is where the hash starts off that is one nest above it and well the cost is pretty much zero and so and so forth now these teeth ring cost rows and width and the way it scans the table it's all sequential scan over here the way it joins the different tables and the way it aggregates it and sorts it these are the key things that you need to watch out for now the why you need to look at how the query planner is scanning the database or the table is because well if the columns are indexed and explain plan shows that it is doing a sequential scan there is something wrong going on you want to investigate that and you want to make sure that it gets fixed so look at the explain plan look at any anything that might be wacky in there but what you what you want to really focus on are again this is very important that's why I keep emphasizing is cost the number of rows return and the width of each row that is returned now in order to help the query planner do its job properly it's important that analyze is run periodically on the tables what analyze does is it updates the statistics of the particular table let's say again there's a table in your database that is updated very frequently that has a high level of insertions and the query planner plans queries to execute on that table you don't want it to assume x number of rows whereas now there are y number of rows where y is significantly higher than x because then the planning is going to go haywire and running analyze on the tables periodically helps ensure that the database statistics are kept up to date and in order to help the planner calculate the costs accurately and this is the cost that you know we were referring to as an arbitrarily or a relative relative number the database parameter the configuration parameters are there for tuning you can adjust these parameters based on your workload and the hardware that is available to you and we are going to do a little bit of a deeper dive into the database parameters in just a little bit so don't worry if you don't yet understand what this means over here now moving on another thing to watch out for a very simple concept is watch out for locks as you program and as you write the sequel code the simplest example that I could come up with let's say there's a there's with one session that starts off a transaction it updates the table foo and it sets the value of id one two and three and then commits it session two is also updating that same table and also tries to update id one now while session one is taking place it is it has locked id one the row with id one because it has updated it and till till the commit happens it will keep an exclusive lock on that row and session number two is going to wait for that lock to be released before it can execute its statement now you want to make sure that such exclusive locks are only used when they're absolutely required and this is especially critical for long running transactions because a long running transaction could hold on to an exclusive lock for a long time and which will prevent other processes from updating that same table thereby impacting the performance of your database so key takeaways from sql and query optimizations monitor your queries make sure you're on top of them that you understand which your longest running queries are what your most frequently run queries are analyze how those queries are being executed to be aware of any anomaly anything that looks wacky or off and code in a way that avoids locks now the next point that i'd like to discuss are some performance features that postgres offers and mostly i would want to talk about indexes over here now postgres offers different types of indexes that you can leverage based on your use case based on how you want to query the data that is in the table the default index if you don't specify anything the default index is the binary tree index and you know as is evident from its name it structures the data in the shape of a binary tree and can be used for for equality access it can be used for range it can be used for greater than less than etc etc hash indexes are focused on equality checks so essentially it stores the hash of the key and it can do a very fast lookup for an equality check now a hash index is really bad for range comparisons which is where the b3 index shines so this is use case dependent a hash index should be used when you need fast equality access a composite index is multiple columns combined which is used when you have queries that frequently access data where the conditional clause is dependent on multiple columns so if you're where has not one but two or more columns involved in it having a composite index on all of those columns will help that query retrieve data faster a partial index is not just on the column but there's a conditional statement or a conditional expression that that it evaluates it makes the index smaller in size it doesn't index the entire data it only indexes the data that satisfied the condition of the expression a covering index is used well there's a very slight difference between a covering index and a composite composite index a covering index also captures multiple columns but this is used when you don't want your queries to hit the table you only want your queries to retrieve data from the index the way it's defined is you define an index on a particular column and you use the include keyword for some of the other columns that you usually retrieve data from as well when you query a conditional clause for column a so let's say there's column a where the index is being implemented and you as a routine also call column b and c in the select queries if you include include b and c in your covering index what postgres will help you do is when you query that particular table using the index it will retrieve data from the index not from the table thereby saving you time and giving you better performance brand index short for block range index essentially what it does is it stores the minimum and the maximum values from the page size and not the entire data so essentially for very large data which is spread across multiple pages it will help you save a lot of disk space it only captures the minimum and maximum values on a single page and then it can zero in very very quickly into a value that you might have queried now with indexes as well it's not a one size fits all so be careful as you use your indexes indexes do take up space and they will cause bloat if you overuse them so please don't overuse them and a few conditions where indexes are probably not recommended if you need most of your data anyways postgres will end up doing a sequential scan on your table and the index is just going to be not utilized and wasted you're wasting disk space if you have a very heavy write or update workload and very few reads index is not going to be used and each time you insert new data or update data in your table the index will need to be updated as well which means that you're actually slowing down your database while not leveraging the index functionality so for write heavy workloads update heavy workloads and workloads that have very seldom reads indexing is not a good idea we've already talked about data bloat if your table is too small and you know a sequential scan will just fetch it very very quickly no point in inserting an index in it and a few highlights and you know these are just examples there are many such examples in postgres performance features that just work you really don't need to do anything to invoke these features or make them work postgres will is intelligent enough to for example plan for parallel query execution leveraging the multiple cores that are available to it in if it can it will avoid index changes using heap only tuples if if you are changing row that is or in a way that is not impacting the index or that's not updating the index column it will use heap only tables tuples to to avoid the index updates and thereby improving the performance uh incremental sort if data is semi sorted postgres will not just start from scratch and sort the entire data it will intelligently leverage the semi sorted nature of the data and help sort it faster auto vacuum i know at least yesterday there were a couple of talks that were focused on mvcc and auto vacuum so as as updates occur and as dead tuples increase in in the table the auto vacuum process will kick in and automatically clean up the dead tuples thereby saving you table bloat and getting rid of all you know the dead tuples now all of these things happen automatically all all postgres asks you to do is to let it do its job and it will do it so you don't need to tinker with anything you know don't need to invoke anything yes so yes but you can optimize them you don't need to tinker with them for this to kick in even if you don't optimize them postgres will still execute queries in parallel with the default values so so yeah optimization definitely i'll come to optimization in in the next section there's lots of optimizations that you can do i i completely agree there are parameters that will help uh there are parameters for uh for example auto vacuum as well right that can help optimize how uh auto vacuum is done but the point over here is that you can go into that uh if you know what you're doing which of course you do hurry but not many people do um so key to key takeaways from the performance features that you can leverage uh indexes are a powerful ally but please don't overuse them and let postgres do its job don't tinker with it too much it will for the most part very intelligently figure out the best way to execute queries now the next point is architectural improvements that you could possibly make in your in your database and the first aspect of it is load balancing and what load balancing does is essentially it distribute queries across multiple nodes or servers in order to make sure that one node is not um is not over utilized or is not burdened with all of the all of the queries and all of the transactions that are taking place the benefits are it prevents the single server to become a bottleneck it can facilitate horizontal scaling and well the byproduct is while we're working on performance but the byproduct also becomes the system overall becomes resilient to disasters because now you have multiple copies of the data and in case one of these nodes break down you can fail over to the other node graphically speaking this is roughly uh what it looks like so on your left over here is a single server that is serving the application which is writing and reading from it and that same database is serving uh the reporting and analytics for the business now all of the the right and the read workload from the application and from reporting and and the analytics have to be handled by this one node and the node can be overwhelmed um we had a very interesting example recently again you know this is this is back from my my hometown I was working with a team that is monitoring well a little bit of context actually before that so Pakistan unfortunately is one of the very few countries left in the world that still has is is endangered by by the polio disease the crippling disease and there's a massive effort ongoing in the country to uh vaccinate and immunize uh you know kids against the disease and there are campaigns that are run and we were working with the the organization that is tracking the uh administration of those doses to all the kids in the entire country and what they were doing is that they were using the cell phones of all the workers that were going door to door all over the country and tracking their their location pings storing their lat long and creating heat maps in a central location to to determine where the workers were actually going and what parts of the country they might have missed right so lots of data streaming in um and a very nice dashboard built up to create a nice decision-making environment and they came to and and well that the entire data was being stored in Postgres now they came to us saying that when when they start running analytics on the data that's coming in the rendering on the screen for the real-time pink that are coming in from the field that slows down significantly right um I actually went to the into their office trying to you know determine uh what what's going on uh now this was a team that uh you know had really good engineers uh good skills within the team uh but one thing that I found surprising was that all of these transactions all of this processing was being done on a single node on a single server so lat long pings from all over the country from all the workers going door to door coming into the same server uh all analytics being run on the same server um and when the management wanted to run some reports uh the pings coming in the the database would just refuse to ingest them and and this is what what we recommended what we said was uh let's let's load balance your database so you'll have one node that is dedicated to ingesting the data and we can set up replication to two standbys and the application that is sending those pings whenever it needs to read the data can pick it up from stand by one and you can run your reports on stand by two and you know just just splitting that load helped uh you know um decrease the pressure on the single server and then you know they just started everything just started working fine so load balancing uh can be a very uh interesting um uh trick or a very simple way to elevate some of the pressure uh or sorry reduce some of the pressure that one node might be feeling if uh your workload is both read and write and there are different parts of the business that are doing different kinds of stuff with the same database you might want to split the database up to serve those different parts of your business um a simple example that I you know came up with actually this was a quick hack into my own laptop using pgbench uh you just run three nodes on the same laptop versus you know one node on the same laptop uh laptop for 60 seconds run a select only pgbench run and even even on this very very simplistic measure on the same hardware we could see performance gains where the tps was about 371 for a single node and it went up to 485 for three nodes again now the hardware is the same it it is running on the same hardware the resources for three nodes combined over here and one node over here was the same and even then Postgres was able to perform about 30 percent better when the load was distributed so this is you know one one easy way to improve performance load balance it another way to or another aspect to look at is partitioning and what partitioning does it it divides large tables into smaller and more manageable pieces and the benefit that you can get is that queries are going to perform better maintenance is easier and indexes also perform better and are smaller an example again you know visually speaking uh let's say you've got an application that writes time series data and you know you have data for the entire year one way to possibly partition it is by quarter which is you know what is being done over here so think about you know these these four partitions as q the first quarter second and third and fourth quarter now if you run a very simple query that is you know select start from through where month is august what the application does is it queries this database right here and it needs to scan the table and pick up the data from from august whereas if and and this of course you know can be a very huge table whereas if you use partitions when you run that same query you can zero in on the third quarter and you are now dealing with a much smaller table much smaller partition over here instead of the larger table and you will significantly improve the performance of your database and of your queries key takeaways from architectural improvements well it's just one takeaway don't overload a single node you know if you if the the node is being overloaded split it up scale it horizontally split up your your database the last point is parameter tuning now we have hinted at this earlier in the presentation there are configuration parameters that you have access to and you can change them in in the database for the most part however for the most part defaults tend to be good enough um and very broadly speaking you can categorize the available configuration parameters uh into allocation um how much memory or resources uh get allocated to the database uh and defining of costs to help the query planner now three examples of easily tuned uh allocation database parameters the first one is shared buffers and what shared buffers define is the amount of cash that uh frequently accessed data can use in in the system uh the default is set to 128 mb usually the recommendation is to set it somewhere between 25 to 40 of available system memory you do want other processes to be able to access the memory as well but uh you know um this the shared buffers is going to define how much cash memory the database is going to use wall buffers write ahead logs again you know this is something that other people have talked about uh the write ahead logs so i won't get into a whole lot of detail about what these are but essentially before data is written to disk it is in memory in buffers and the amount of data that the database is going to hold in memory before writing to disk is defined by wall buffers uh default is set as 3 percent of shared buffers um and in high concurrency workloads uh raising the value up to 16 megabytes can improve performance work memory is the memory that is available for a query operation so uh this is uh a single operation or single atomic operation within a query uh or the memory allocated to a single atomic operation within a query the default is 4 megabytes and uh with very high uh IO activity and increase in work mem can probably help because you know much of that can be done in memory but you have to be careful while setting it too high a value can flood the memory with lots of uh lots of queries because every parallel transaction that is taking place is allocated this amount of memory so 4 4 megabytes is the um is the default and you can set it a little higher based on the amount of resources available in your system uh three examples of parameters that define the costs for query planner the first one is a cpu double cost and essentially it defines the cost of processing a single row of data um and this includes you know the wares and joints involved the default is 0.01 and a lower value indicates that um the cost of processing a row is very very low which encourages the process uh the the query planner to process more rows together and uh it's helpful for for you know IO bound operations where the bottleneck is actually disk IO not cpu so cpu is is used more in case the cpu double cost is low in case the cpu double cost is high uh the query planner tries to use less rows in memory and it it it tends to be more IO oriented so if your system or if your setup is cpu bound if cpu is the is the resource that you're lacking you can set this higher raise the cost so the planner tends to um uh process uh low rows less number of rows and uh help with the operations random page cost defines the cost of accessing a random value within the database the uh default is four a lower value implies that random access is cheap which is good for ssds and uh it encourages index scans if you have a magnetic drive a spin drive uh a higher a higher value is probably more suited because it is more difficult to access random values in a in a in a magnetic disk and uh it thereby encourages sequential scans which are easier to do on a spinning disk effective cache size it's not exactly a cost parameter but it's worth mentioning because this is what the query planner expects uh in terms of the cache size to be available to it and it combines shared buffers and os cache together the default value is four uh gb uh but the point to note over here is that this is not an allocation this is an indicator to to the query planner as to the amount of uh cache size that is available to it so that it can plan its queries accordingly um higher values of course imply that there's more data that can be put in cache um and that encourages index scan because you can very quickly access it and lower values imply that there's less data in cache so you might need to go to the disk um and it encourages sequential scans uh some takeaways uh from parameter tuning you want to tweak your parameters based on not only your hardware but also your workload so you want to know what resources are available to your database and you also want to know what to expect in terms of the type of workload your database is going to be processing and you will need to experiment a little bit especially with the cost parameters you will need to experiment a little bit before you can fully optimize uh your database which also makes parameter setting a very good candidate for AI based tuning and there are you know um at least a couple of startups that are working focused on Postgres on AI based tuning and I find these those projects very very interesting because uh this is actually something that um as consultants we use up a lot of time on and I think that this can be automated so it's good that you know a few people are are working on it uh pretty much to you know moving to the conclusion database performance there is no magic button don't look for it it involves a lot of variables uh and some of them are fairly easy to tackle you don't need very deep database expertise to be able to tackle them tweak those uh those areas leverage your database to the fullest and only once the database has been leveraged to the fullest should you think about using your credit card as you scale brings me to the end the QR code is my LinkedIn profile connect with me on LinkedIn I'll be happy to connect and talk about Postgres and in case you are running into problems you can just call us and of course keep calm any questions questions to enhance my understanding of the difference between a composite and covering index um you said the covering index uh is useful because you've added these rows and now you can retrieve or columns and now you can retrieve those columns from the index rather than the table in a composite in a composite index can you also do that when if you know if I have three columns in my composite index and those are the only three I'm querying for do I get those out of the index in that case too so the the you can see that that the composite index um does everything that a covering index does and a little more so with the composite index um each of the columns are sorted in order a covering index will only sort by one column and include the other uh columns so yes you can you can use competent index sorry um in in in the index sorry I don't quite understand that point okay okay yeah right right yeah yeah yeah yeah exactly yeah right so basically composite allows you to filter by each of the column right and the covering index is filtering by one column and the other data just happens to be in the index as well um what is the overhead difference like there like how you know is if I'm only worried about two columns is it is it worth it to use a covering index or or does that really only come into play if it's 10 columns out of 100 column table I think it will very largely depend on your table on the size of your data how it's structured I think you will need to experiment with it a little bit to understand how much benefit you are getting right there is benefit and the benefit is because uh the columns are not sorted or not indexed in in the covering index um and whether that's worth it or not depends on the use case okay any more questions um you said that uh the uh index and when you when you need to uh access all the data from the table uh and an index is more or less useless so my understanding was you use was might be wrong that the index is used to to look up the position in the table uh to get all the data uh is that different in postgres so um I don't recall the exact percentage but um if you if you query uh data from a from a table that uh is is seeking to retrieve more data than x percentage of the total size of the table like postgres query planner will actually resort to a sequential scan even if the column is indexed okay because that's cheaper to execute you're you're retrieving that much data anyways and if your use case is such that you retrieve a lot of data or a major portion of the data from the table anyways all the time then indexing that column you're just wasting space you're not really using it postgres will uh the query planner will go for a sequential scan in that case anyways so you're just wasting space and can can you see that with the with the um on the beginning you had this uh cost uh measurement can you can you see it with with this yes yes so what you what you will observe is um actually it's a bit um I I don't have the the the slides for it or the you know the database setup for it but it's a very quick and easy experiment that you can run um you know create a table um any table and uh put a lot of data in it indexed by one column run a query that retrieves let's say two percent of the data and run you know an explained plan on it and then run another query uh with just the where clause change such that it retrieves let's say 90 percent of the data and run the explained plan right in the first instance the query planner will show you that it does uh you know the um it looks up the index it does the index scan and in the second instance even though the column is index it will run a sequential scan it will show you there okay thank you and are the slides maybe available online uh I think so yes I think I think that you know once once all of this is wrapped up uh the scale uh this you know scale organizers will make it available on the website okay thank you and the presentation is also available on youtube on scales youtube channel uh any more questions we have a question this is sort of a general question let's say you have uh postgres installed from ubuntu packages right out of the box can you speak a little louder please do uh general question let's say you have ubuntu installed from uh sorry you have postgres installed from ubuntu packages right out of the box on a production server what are you reaching for and and it's a server full ssds for storage what are you reaching for first as far as just general performance improvements to make that perform better right out of the box so right out of the box I think that some of the things that you can do is understand uh what hardware resources are available right what what memory you want to allocate to the database to use so right out out of the box you want to understand uh how many cores you can use for parallel workers uh how much memory you want to allocate to shared buffers to to wall buffers and uh let's say you know you've got a ton of memory available and yet you're allocating only I don't know 12 gigabytes to uh to shared buffers that would be an under utilization of resources so those are some of the things that you can do you're straight out the bat cool thank you okay uh any more questions okay one more got another one um partitioning is that something you can only use when you sort of have a um set range for where the data that you're partitioning on will fall in like you said with a year we divide into quarters or is there a way to set up a table to be partitioned even as I am continually writing new data to it I have a log table say and you know I wanted to create a 2025 q1 when it gets to be that time is that a thing that works yes it does it does so essentially what you do is you define a rule on how the data is partitioned and you can define that rule for the data to be routed to the right partition as you insert it as well so that it can be a dynamic rule right so it can be independent of the year in this particular example it can be independent of the year so all months that are January through March will go to q1 independent of the year so as you write data it will go to the right partition thank you okay I'm afraid we're out of time but thank you okay thank you very much oh I did that yesterday oh okay thank you test test is it working now right sound test sound test if you are sitting in the back can you hear me on the speakers right now we give it 30 more seconds or get started pretty soon how's everyone doing home stretch of the postgres track right all right welcome everyone to my talk titled uh testing your postgres backups um a practical guide uh my name is nick mayer and uh let's get started with uh welcome to pasadena it's a pretty uh it's a pretty fun place um it's been about six years or so since i've lived here full time but i do enjoy coming back here quite a bit um you know i can't point to any specific reason why i'd say it's probably mostly just uh nostalgia speaking of which you know nostalgia it has this way of being this contagious social phenomenon like especially nostalgia around like the roman empire that whether it's on social media or similar um you know alongside it there tends to go this narrative that uh we just don't build things like we used to uh we had this you know there's this notion that there's this ancient knowledge that we have lost that you know we have these we have these ancient structures with roman concrete uh roman roads and yet somehow engineers nowadays you know we can't build things that last more than a few decades and yet this stuff has lasted several millennia so what's up with that and perhaps some of you are uh you know maybe you just walked in or maybe you tuned out for a moment and maybe some of you were wondering okay wait what is going on here who is this person who's talking in front of me right now you know did he introduce himself where were his title slides i thought this was supposed to be a talk about post-dress backups and if you're in that boat don't worry um i promise you you're in the right talk uh we'll get to all of that shortly i just wanted to start out with this analogy in order to explain to you all you know upfront what this talk is not going to be in this talk i will not tell you or give you some uh conception that i know everything there is to know about backups and that i'm just an expert you know an ab the end all be all expert at backups uh because that's not true i am not going to tell you that uh my employer or i have you know formulated a perfect backup testing strategy that will leave you fully confident risk-free and i'm certainly not going to tell you that we've never made mistakes uh and you know the reason i bring up this analogy of the roman empire in the first place is that if we want to if we want to replicate success you know in order to do that we can't just look at examples of what people have done and found success we don't know necessarily whether that success was just due to luck we don't know you know what they tried that didn't work and often you know a common theme i found uh certainly like in my experience with post-dress backups uh but you know i think it applies pretty generally is that a lot of the time what we learn the most from is failure from things that don't work out and i'm hoping to give you all sort of a taste of like both sides of that coin uh from my personal experience uh speaking of which you know you know some uh introductions are in order uh so who do i work for uh academia.edu who are we uh we are an organization whose mission is to ensure that every paper ever written is available on the internet available there to read for free online uh more broadly what our goal is is to accelerate the world's research and you know to go into some of our successes if you will or some stats at least pointing towards our progress towards those goals so far we have about 47 million papers on our site that have been uploaded depending on how you do the rounding uh let's say that we think there's about a hundred million papers that have ever been written um in the academic world it's constantly moving target so maybe we're about like 50 percent of the way there um on the on the theme of accelerating the world's research there's a common story in academic publishing where someone will publish a paper and it gets read maybe by three people total give or take one read by reviewer number one another read by reviewer number two you know you probably see where this is going reviewer number three is the final person to ever glance at that paper and that's just the way it is and at least let me put it this way there's a lot of research out there that deserves better than that and from the perspective of the world you know if this is impactful research that can make a difference in people's lives and in our understanding of the world all the world deserves better than that and you know to that point uh we we have been addressing that problem to the tune of 20 million paper recommendations per day making sure it actually gets read by the right people okay but this is a postgres talk i should probably give you a little bit of context on what we're working with particularly this is a talk about backups maybe i should answer the question up front what you know what is our scale what size of data are we backing up uh because this probably has some implications for you know if you are at a different scale than us whether that's three orders of magnitude larger or three orders of magnitude smaller maybe you'll need to make some adjustments we have about a hundred terabytes of data if i round super aggressively distributed across 15 independent different clusters entirely on aws and no matter how you slice it it's about half and half between aurora rds and our older self-managed ec2 instance clusters um you know one for a variety of reasons ha never-ending need for high uh you know high read throughput in our workloads we have tons of read-only replicas um and you know getting to like the team size since i think this is pretty relevant too this is a practical guide it's about balancing all the things you want to do with all the things you actually can achieve in practice we have maybe about 50 engineers total across the entire organization divide that in 10 for people more on the infrastructure side of things speaking of which who am i uh hi my name is nick mire um i have my hub link there we have to go even earlier than the romans to the greeks for aristocrates so imagine you take aristotle and socrates and mash them together into aristocrates sounds a little uh pretentious too like an aristocrat or something um but yeah uh who am i uh at academia i'm currently the team lead of our platform engineering team my areas of focus uh on that team have included uh the developer experience uh the kind of the interface between our application layer where the rest of the develop engineering organization works and our infrastructure layer um and because you know as i mentioned before we're not like the largest uh engineering team in the world uh there are some interesting staffing tradeoffs which result in the data layer and postgres in particular being parts of my uh you know daily uh daily work and in particular uh speaking of postgres and postgres backups uh allow me to introduce you to academia's old postgres backup solution um you may have heard the advice at some point uh in your life never roll your own backup system um allow me to introduce you to a ruby script all right it's not as bad as it sounds it was a ruby script wrapping pg base backup if you've heard of that tool um in postgres uh but even so i would say that like there there is more to worry about than just the absolute lowest level it's still it's still i'd say not the greatest idea so you know one thing that is really nice about this is it was a great way to learn about backups and all the things that would go you know the all the things that could go wrong with them uh i learned a lot having to maintain the script and ultimately you know replacing it with pg backrest uh if you've heard of that tool uh much nicer inner much nicer a lot of things than that tool was um so you know a great learning experience but otherwise a bad idea i would not recommend this um a little bit on the note of this being a practical guide i think that as as an engineering organization who has that has made mistakes and as an engineering organization that doesn't have like the staffing to do everything that we'd want to do um my goal is to cut out a lot of the noise because a lot of times what is just as important as what we choose to do is what we choose not to do because it's not worth it or it just doesn't pass that trade-off muster um but you know i don't want to give you the impression that we've only ever made mistakes i do think that one reason that why one reason why we're still here today with our data you know more or less mostly intact um as it was intended to be is because we stumbled almost inadvertently onto a pretty solid backup testing strategy um i don't think that we were particularly intentional about how we came about this i think that we ended up stumbling upon it due to mostly business reasons um and i'd love to share that with you today so to sum it in short although i'm not going to be able to teach you all in 45 minutes give or take all the things i learned from having to maintain that ruby script and fix things when things were broken experience may be the best teacher hopefully this talk is going to be a slightly more cost-effective teacher for all of you so that you don't need to go through all those stressful moments yourselves firsthand um okay so i think that it would be useful to have a little bit of structure for this talk just so i don't go on more tangents about the roman empire history or similar things like that so i'm going to flip the order still a little bit of the order of like what i talk about here so i'm going to actually start with the why like why do we have backups that that sounds like a pretty that sounds like a question with a pretty obvious answer but you know if we were to formalize like what are the what what are the problems we're actually trying to address more rigorously with backups uh then after that i'm going to get into the how starting with you know what what is this strategy that i've alluded to like how do we test them uh but also how do we set measurable goals that can be quantified around backups how do we monitor them uh you know part one of that is let's get enough context on how backups work so we know what to monitor and then how do we actually do that uh so let us get started with this thought experiment of what could go wrong so let's say we have let's call it a high availability setup of some sort um you know we don't have all the information here but just trust me for now we have you know a writer node uh we have some read-only replicas following it uh we've even thought a little bit about eliminating some obvious single points of failure that there might be in a setup like this such as for instance uh you know to use like amazon cloud regions you split up the regions so that they're not all in one region um you know side note we don't actually do this this was a little on the expensive side for us with data transfer fees and such but you know let's like make the let's really steel man this example of this several node setup where suppose we don't have any backups yet um okay so maybe we think we've addressed like what if all nodes go down we don't think that's likely because the world has bigger problems if us west two and us east one are both down concurrently uh you know a question we might ask is what if some nodes go down does the application know how to handle that are there errors do things work have we tested it have we thought about it um but you know maybe like a bigger point of concern i would bring up with this is our good old friend uh drop table users or delete from users uh what if that gets run in the wrong environment okay so maybe some of you were thinking like okay this is this is a problem that other people would face at other organizations you know we have stricter controls in place uh then we we have strict enough controls in place that like there's just no way a dev would think they're connected in one environment but actually be connected in another but you know the question i'd pose to you is do you do you really trust your automation that much is it like you know seriously what what is what is the control in place there like at a higher conceptual level like is it is there someplace somewhere where a string might just get mixed up one place and this can get routed to the wrong location so you know i i've seen some pretty crazy stuff that i'm not going to get into uh but you know i think that like we at the very least if we care enough about our data to have backups then we should care enough about our data to want it to be you know we we should we should want some safeguards or some some options or recourse if this were to happen and the problem with this setup is that uh this this dml just replicates immediately to these replicas under our current setup uh maybe some of you are thinking okay but like they're we may or may not be aware that like there's a way to have delayed replicas with post-dress maybe we can put a six hour delay on just one just in case i feel like that's kind of fallen out of favor in the modern day you know because like what if you do what if we don't notice for one millisecond after that amount of time you know i've definitely seen cases where there's been data loss on systems other than post-dress let's just leave it at that um that we didn't notice for a day and just had to recalculate um okay so all right all right you've convinced me or i've convinced myself i've convinced you hopefully we should have backups i'm not going to define exactly what that means yet or go into the details of how that works but there's some notion how let's just say we're going to be sending some sort of copy of what's on post-dress uh could be from the writer maybe from something else to s3 just to keep the example within amazon cloud if you you know if that doesn't apply to your organization just kind of translate in your head whether that's a hard drive on some other computer somewhere else or an equivalent for some other cloud um but you know we're still left with this question what do we actually do if you know let's say a disaster does happen we you know we're supposedly prepared we have a backup a node goes down uh what what actually happens at this point and that brings me to you know my first major point i'd like to make here which is that we don't care about backups we do not because we care about restores not backups um with that i'm going to let that simmer a bit as i take a sip of water so what what is our actual goal like what is our purpose with taking backups because if they're right only if they're just this right only thing that we never use then we're just wasting money on them probably um i'm going to say that what we want with backups is there is some point in time where the data layer was in some state and we want the capability to be able to restore our postgres layer to whatever that state was um you know and we furthermore we want it to be sufficiently decorrelated with other with whatever is running postgres that if postgres were to completely disappear we would still expect the backup to exist we'd still expect it to be usable um so you know maybe a question we have is what exactly are we restoring uh we can have different goals here for instance we might want the capability to restore just anything that was ever written or you know to the greatest descent possible targeting that as our north star so to speak or maybe there's some other use cases for instance we need to audit something that happened a week ago uh we need to for some compliance reason look at something look at like some row that was present in some table a month ago but has since been deleted um so maybe like in the past 30 to 60 days we have reasons to look at it other reasons might be like corruption like there's some problem with our data now but we're pretty confident it was okay in the past it's always good to like have heaven as an option the ability to at least like peek at it um another thing i'd mention is restores need to be fast enough to be useful uh i'm not going to define exactly what that means in clock time because that is obviously a pretty that's that's pretty dependent on like what your workload is what your business requirements are what's you know what your data size is uh i would not hold a 15 terabyte database to the same standard as you know a 20 gigabyte database but generally speaking like you know with most of our workloads like bringing it back to kind of like what do we think how do we think of this at academia we want to be able to replace like a node that fails within hours not days all right so we went over what might go wrong if we don't have any backups um let's say we have backups though and let's say that we are appropriately concerned with restores um allow me to introduce you to the concept of schrodinger's backup which we may be familiar with um or maybe we're familiar with the cat uh this isn't schrodinger's cat this is schrodinger's backup uh schrodinger's backup a little bit similarly to the cat states that the condition of any backup is unknown until we restore is attempted now so okay i i don't i don't know if like this forum spotlight on it series number 2 12 in 2013 i don't know if that was actually the first uh reference of this i'm i'm a little bit skeptical it seems but you know it was the earliest reference i was able to find on the internet um but you know furthermore beyond that maybe maybe some of us are thinking that's a little bit this is a little bit too strict like maybe is it really is it really not possible to just kind of glance at it like let's say it's an s3 there's a file browser that you can use in your there's a file viewer you can just view in your browser you can take a look at it kind of by eye uh what are what are what are some specific reasons why that might not be enough like whether we implement other monitoring or whatnot um so let me to introduce you to some actual backup failures if you will that i have seen in prod all right well let's do the easy case first uh backups just weren't happening a lot of the time uh the way that backups are taken in postgres will you know there will be a component of it at least that looks kind of like cron um whether it's a cron job or something that's kind of isomorphic to that um it's probably like a problem we can all appreciate like maybe whatever's supposed to be scheduling some operation to happen just isn't actually happening for some reason or another um maybe that's easy enough to check like you just look that a file is actually in the destination and maybe it's there maybe it's not okay but leveling up a little bit to let's say a stage two enemy um you know stage two boss fight successful backups we have metadata or some exit status or some indication of success that is telling us that this backup was taken uh self reported by the system that's responsible for taking it of course but actually it's just an empty file uh what now what what are some sort of things that might cause this uh well this particular instance if you're familiar with the concept of pipe fail uh this ruby script that i alluded to earlier uh that our backup solution used to be effectively piped the output of pg base backup to standard out piped that into gzip then piped that to s3 you know more or less with a little bit of not exactly how it worked but conceptually that was pretty much it and if you don't remember to check like the exit status of every single step in that chain you end up with the last step receiving nothing because something earlier has failed the last step thinks that it succeeded it has successfully compressed the empty string and successfully sent it to its destination um and that's how you get an empty file uh that is supposedly a complete backup um and then level three boss fight a backup you know we definitely know it's not an empty file there's stuff there it looks pretty good but if you attempt to restore from it postgres it says it's starting three hours later it still says it's starting um i'm not going to like bore you with the details it never finished starting um i'll touch a little bit on what might cause that later so basically um basically the reason why the reason why i invoke the concept of schrodinger's backup is that if we really are worried about all three of these and i've seen all of these happen the only way that you catch every one of these is if you actually attempted you attempt to restore of a particular backup and then you see it work that is the only way that you can defeat that level three boss fight so to speak okay so some additional things that i am not going to talk about today because i only have 45 minutes maybe like half of that at this point uh ransomware data corruption insider threats a common theme along you know among these concepts is that i mentioned that we want to restore the data layer to what it once was uh this is kind of challenging the assumption that that's even something that we'd want to do like maybe maybe that state that has been persisted into our backups is actually bad in some way and we need some additional layers of defense beyond what we have with backups um i think of backups as i think of backups and like backup testing and restore testing i think of this as a solid foundation on which we can build later to a more thorough data security posture if you will but um you could have a full talk on each of these all right so how do we test restores um we're perhaps familiar with the swiss cheese model but if we're not just a or you know just to give a brief recap the idea is um in there we have these red arrows that are representing something bad happening we have these layers of swiss cheese sometimes the arrows pass through a hole in the cheese sometimes they get blocked think of every layer of swiss cheese here as some defensive measure we've put in place to prevent something bad from happening each individual control that we've put in place is imperfect some things we just know we'll get through but the idea is if we stack enough layers of cheese in front of the adversary hopefully if things are sufficiently decorrelated between layers hopefully nothing will actually get through every single one of them it just takes one to block it that's the idea all right so what is part one of this strategy for testing your restores um so motivated by this business need uh at academia where we need a lot of read capacity uh we churned through a lot of replicas as a result of that is every time you need a new replica you should use your backups uh so you know what this looks like is you restore from s3 and then importantly you know connect it to connect it to your cluster through streaming replication step two once that restore finishes um and then one nice thing about like this part of the strategy is when your backups break uh for one thing you'll notice at least as frequently as you need to bring up a new replica and secondly fixing this you know fixing the situation is actually going to be a priority it's not just going to end up on a jira backlog somewhere as a nice to fix um it'll actually be blocking you know a business need and then okay so part two of the strategy to layer on an additional layer of swiss cheese in our swiss cheese model um something that often comes up for us is we need prod realistic data when testing certain operations that might be an upgrade uh it might be something as simple as just like a big batch of you know data manipulation that someone needs to do uh we often have a need to have benchmarks for that how long is it going to take does it break things uh we need to qa it and in order to get prod realistic data in that staging environment uh we've needed to create tooling to bring up a copy of prod in staging and what i'm saying here is uh you should also use your backups for that uh this is similarly you know maybe you promote it right away so it might not stay a replica but effectively you know falls under that case one you need you need a new copy of some database you restore from your backups and the additional like marginal benefit this is adding on top of that previous point is that you're really confirming that you can restore from nothing you're confirming that there isn't some sort of pollution of data that is leaking in from that point where you connect it to the writer there's some guides online that are great when you're getting started but um kind of bypass this where you just take a base backup live from some node that's running and then you connect things up and to be honest we have that code path so this is we have that code path and you know it's a fear of mine sometimes that will accidentally revert to that and not notice that our backups are broken uh so this part too really like ensures that if you at least do this sometimes you're you're confirming that you haven't like flipped the boolean in that direction so to speak okay so everything else on top of this i'm going to say is an optimization you know another layer of swiss cheese so to speak that's not to say that optimizations are not important um i just think that like these are the two most important things in my opinion to do for testing your backups or restores rather if you're not doing it right now so maybe a question we have is how often how about at least once in the lifetime of your database um that's certainly a good start okay so you know to be to be like a little more serious here you know i'll give you the bad news i don't i don't think that there is a specific threshold after which you know you're bad you're in a bad place before and you're all good to go you know even even if you're testing every single backup like how do you know that something hasn't gone wrong with a backup you know technically you need to just continually be testing restores every millisecond to be confident uh but you know my my rough rule of thumb is like if if the integrity of your backups is keeping you up at night find wherever you are on the left hand side of this chart and level up at least by one going to the right hand side so if you've never done this before now is the time to do it once or if you've only done it once maybe now is the time to do it on some periodic interval like once a year or once a quarter once a month and so on uh the more you test it you know the more often you test it the more confidence you can have in your backups and furthermore you know although shortingers backup technically is always going to be is always going to be something that weighs on the back of our mind like we don't really know if this backup works if we don't attempt to use it if we don't attempt to restore from it the more frequently you test this you know the more frequently you test these things the more confident you can be that they work in general even if you don't literally test everyone and you know one last thing i want to say about this before moving on is that i've seen this you know i've seen the strategy deliver value even if you're just doing this at a frequency of roughly you know once a year or once every couple of years like you still you know the most important thing is just that you do this at least once you know you will still catch problems you will still fix them okay so a side note i noticed that we were i noted earlier that we were 50 percent on aurora 50 percent on ec2 instances um just to like kind of emphasize this uh on the on the theme of like one size does not fit all with the frequency since honestly like the honestly like the frequency at which you test these probably depends on your business needs probably depends on your budget for provisioning new instances uh we generally trust amazon to know what they're doing with backups it's managed for us you know provided that we've configured it correctly and so we we test you know we go through this procedure much less much less frequency much less uh frequently with managed databases than we do with our self-managed ec self-managed databases on ec2 instances uh just you know in a bayesian sense our prior is that we like we just trust our own stuff a little bit less than these managed databases okay so getting further moving further along into the how uh what goals should we set how do we quantify you know are all backups created equal no but how do we like weigh them against each other how do we say one backup is better than another how do we put that into numbers uh allow me to introduce to you the concepts of the recovery point objective and recovery time objective so recovery point objective is really answering the question like how much data loss is tolerable um we put a number on that and what the recovery time objective is answering is the question how long is tolerable until you know what is tolerable for the answer to the question how long until we're back like a disaster happens and we need to come back and i think these are best illustrated with timeline you know in graphical form with timelines uh so let's say uh we have a recovery point objective of four hours of data loss uh that's what's considered acceptable and suppose we decide that that means we should be taking backups let's assume for the sake of this that they act effectively as snapshots in one point in time every four hours um so we can see a problem with this pretty much right away in that let's say there's a disaster right now um you know out is starting right now at 2 a.m our last backup is at 10 p.m um we have data loss of four hours so it's not great that we're just barely achieving um that objective of four hours of data loss we also have an outage ongoing until this gets fixed so um illustrated on this graph this is us just barely achieving our recovery point objective of four hours yay us and the recovery time objective illustrated here is you know that end to end time starting at the moment that a disaster happens and we need to restore ending at the time where everything is back to normal albeit with some data loss up to what we defined as acceptable in a recovery point objective uh that is you know that is the recovery time we're achieving and suppose for sake of argument we're also just barely meeting a recovery time objective of four hours uh that's what this would look like here all right point in time recovery uh is this idea like we you know that idea i mentioned earlier about uh we need to audit something that happened in the past uh maybe there's some specific snapshot that we'd like to restore to whether it's friday sunday um or maybe some points in between is that even possible i'll touch on that a little bit later but that's the idea of point in time recovery is what points in what points in time in between when we're taking our backups can we restore to um so you know some specific numbers that we target let's let's take a hypothetical database that's roughly 15 terabytes large um may or may not be modeled after an actual one that we have a recovery point objective is we're targeting well we want to be able to restore everything with an allowance of several seconds to several minutes for um you know just to allow a small allowance i'll touch on like how that's possible later like is that even possible do you have to take backups every single second the answer is no uh recovery time objective of about like six hours or so to restore a copy of the data multiplied by three if we actually need to do a cold restore and with point in time recovery um as an option for one month and continuous points in between okay so maybe i should touch a little bit along along the lines of how backups work since i've gone a little bit out of order here um there are some concepts that should probably be introduced to for that recovery point objective i mentioned to make a little bit more sense um word of warning before we begin though is that this is not intended as a guide on how to roll your own backup system um as i mentioned earlier we were you know you could say that we were kind of rolling our own backup system we were wrapping a trusted you know trusted tooling at least in some contexts ruby scripts wrapping uh pg base backup but we're not going to go in depth enough to replicate um all of these other tools that we may have heard of such as pg backrest barman wall g etc um our goal really here is to know enough information that we know what to monitor we know how to test our restores we know how to monitor our restores so to touch on the difference between physical you know the physical regime and the the logical realm uh so this you know applies not just to backups but on some other levels as well let's let's define like what is logical um as far as the data is concerned as like what will be seen what is the reality that is seen by sequel clients that are connected to postgres and running sequel commands generally so if you do a select from users with some predicate or if you do like dml you're modifying data you're inserting rows changing them um you know and you're you're avoiding some of those special columns that we may have seen in some other talks like the ctaid or xmin xmax you know we're just looking at the columns that we've actually defined holding our data um in addition you know tools like pg dump for instance uh pg dump can give you like it can output a sequel file that just dumps out something that can recreate at a logical level a database you know your current database if you were to load it into something that's blank um but in contrast the physical realm of postgres is kind of what postgres sees when it's actually looking at disk to do what postgres does which is effectively like mapping our data structures mapping our tables onto what gets persisted into persistent storage and this consists of the write-ahead log um in conjunction with the data in actual files like the ones and zeros in pg data's base directory which i'm not going to go into at all instead i'm going to give an overview of how the wall works and how that fits into backups so um just a quick like recap if you know in case i suspect like some people are familiar with wall maybe some people aren't uh the effective story of like what the write-ahead log is with postgres is that like every time you write postgres action you know every time you write to postgres through a query postgres actually writes twice or you know maybe more than twice but not to get too far into the low level details let's say we're like inserting into a table one row what postgres is going to do is firstly and secondly in some order at least it's going to persist that into its shared memory it's going to persist it actually on disk into this append only file called the write-ahead log that happens very quickly because it's append only um then and only then once that has happened uh it's going to tell the client it's going to return to the client saying like okay i uh i committed that like i committed the transaction uh you are safe to assume that that data has actually been persisted um a note on checkpointing so checkpointing is the process by which postgres eventually makes sure what is in its memory gets flushed in some consistent state onto its structured storage um and in between the check you know the last checkpoint that was taken however long ago that may have been ideally not too long ago but ideally for performance not too far not too close to the present uh there is this string of append only write-ahead log that gives us the instructions on how to recreate a consistent state of the current moment right now even if postgres were to crash and it needs to restart and sort of like a dirtier state than we'd like it to uh and so as a part of any physical backup system that involves taking physical snapshots of whatever is on disk and structured storage um the write-ahead log is actually a component that is not only essential for filling those gaps in between which backups you take but it's actually you know on a on a lower level of how the backups get taken it's an essential component of taking any physical backup so there is this parameter called archive command that we may be familiar with or maybe not uh but basically whenever a wall file is uh written or you know postgres is done with it uh postgres manages you know postgres just calls this archive command which we can configure to send it to s3 if you're using tooling like barman wall g uh pg backrest um they'll just have configuration guides on exactly how to set this up and so if we go back to that graph from earlier on meeting our recovery point objective we're taking backups every four hours that's just you know that's just barely enough to sometimes meet our recovery point objective uh with the wall we actually shrink that data loss window to much much lower than uh what we had without it you know and furthermore if we think back to uh point in time recovery that's how we fill in the gaps between uh the backups that we take all right so comparing physical and logical you know i'm not going to go i'm not going to get into the debate of whether pg dump is technically a backup tool or is technically not a backup tool or is like uh angrily not a backup tool um i i feel like it's safer to stay away from that but you know even even if it were a backup tool it's probably it's safe to say that it's not a very good one at least measured quantitatively along the lines of uh the recovery point objective and the recovery time objective uh compared to logical backups physical backups give you a better recovery point um you can achieve a better recovery point objective and you also get continuous point in time recovery you know you can restore not just to like every every two days on the dot for the past month but all the time in between as well um and in addition just like as a performance consideration restores tend to be a lot faster with physical backups than with logical backups so your recovery time objective is a lot better as well and that brings us to the concept of super physical backups um maybe there's a proper term for this that someone can shout at me at the end of this talk uh but i basically took everything that is below the level of what postgres even sees you know file system or block device snapshots whether that's zfs lvm ebs you know elastic block store on on amazon cloud things like there are some things like that that can take consistent snapshots of whatever they have um at some moment of time and i'd say that like that's certainly a possibility for you to use uh some advantages of it even might be that if you use mysql as well or a uh collection of different databases and you want consistency in your backup solution for all of them super physical is probably an option for you there uh the main thing i'd point out though is that uh physical you know just normal physical backups can be a little bit less fragile for instance if we you know if you've heard of the create table space command that lets you put some tables in one db onto another physical disk um you know you you might you can instantly break your backup solution if you're not careful with something like that if you're depending on super physical backups um in addition i'd point out that there is just better postgres specific tooling for physical backups and this is a very hard thing to roll yourself so for that reason the focus of my talk is on these physical backups but super physical it's an option as well um i'm just not going to touch very much on it monitoring all right so good monitoring is loud when it needs to be like something is broken a human needs to intervene someone needs to figure out how to fix it but great monitoring is quiet the rest of the time you know yesterday at a talk someone um you know there was a quote from there were some quotes from brendan greg mentioned and i'm gonna roughly paraphrase one of my favorite personally which is that good monitoring or good debugging methodologies should not just identify when a system is broken it also needs to exonerate like if you really want to have great monitoring it needs to be able to exonerate a healthy system um and by that i mean like it shouldn't have a high false positive rate it should be providing more signal than noise okay so on that theme what are some ways to notice when restores are failing in the first place since this is this is like what we actually care about this is what this is how we prove to ourselves that we don't have a schrodinger's backup we have a real macroscopic real backup that we can use should we have alerts should we have dashboards one strategy that has worked particularly well for us at academia is just a slack bot you know we use slack so if you're on teams uh or something else just translate this to whatever you actually use uh and it will just tell us when uh well it tells us multiple things one thing it posts a message whenever a setup starts along with some metadata like where is it um what environment is it in what type of database is it uh and then it'll do one of three things it'll either uh when it fails if it's destined to fail it'll tell us that it failed or if it worked it'll tell us that it worked or we'll see nothing at all about you know we'll see nothing no updates for a long enough that someone gets suspicious and looks into it like something must have crashed before it got to the point where it reported its progress another thing that i'd like to point out that i really like about this is that you can just like see you can just see your recovery time right there like what is the recovery time you achieved this one was 10 minutes because it's a qa instance it basically doesn't have much of a data set at all this is just uh the time it takes to do all the other things we need to do like install packages postgres included um but to give an example of a production system uh this actually happened just yesterday um so today was yesterday you know 3 p.m um i noticed that we needed to replace um a well a roughly 15 terabyte database uh there had been some suspected degradation in its disc in the disc performance of a node so we decided to just bring up a replacement for it and it completed around 9 17 p.m six hours ish give or take all right so let's think about digging a little bit deeper into the step by step process by which restores happen in postgres though uh so what happens with a physical restore is i divide this into roughly three stages um we could maybe break it down a little bit more uh but there is a stage where we are restoring you know the physical snapshot of the base directory in our case from s3 on to disk that takes some time um you know in this particular case we were able to get roughly three terabytes per hour um if you are also using s3 and you're also using you know streaming to like an ec2 instance and you're noticing you're not getting that despite the concurrency set one recommendation i would make is to check your disk throughput setting especially if you're using gp3 volumes or similar um but you know moving on to stage two what happens after that is it needs to pull wall which hopefully is stored in the same place um if any of those are missing you might get one of those stage three boss fights where postgres just never boots up because you actually don't have a backup at all you have a bunch of different files um you know you don't have a consistent snapshot of the base directory regardless of like what tooling you used if you're missing any of these wall files so it's going to need to restore enough wall files to boot in the first place and then depending on what you've configured a recovery target to it's going to need to replay those wall from your backup destination as well to catch up to the point where we can move on to stage three which is we connect it um i'm i'm i'm drawing the arrow as the direction in which like data flows with streaming replication when we connect this new node to the existing uh node that we want to stream from uh whether that's the writer or some cascading setup uh but you know the parameter is actually set like on the uh subscribers side or the uh you know the standby side uh primary con info and so in principle we want to maybe dig into monitoring each of these steps how do we do that uh disk usage is kind of useful for stage one you know in datadog which is what we use i'll give some graphs from datadog translate it to whatever observability tooling you use if you don't use datadog but we get uh free disk right out of the box um assuming the nodes characteristics are consistent you know the new nodes characteristics are consistent with the existing ones we can see that as it comes up it starts with more free disk than it eventually ends up with if we're catching this like mid setup we can kind of just extrapolate to the point in time that we expect it to catch up and then we're done with stage one um in addition another metric that we get just out of the box we don't need to code anything custom is a replication delay time metric uh so we can just look at a graph and see these slopes you know these downward slopes ideally downward if they're going upward that's a problem but ideally downward slopes indicating when a new node has finally caught up and it's streaming continuously it's at the same point in time as whatever it is following from um in the you know in most cases our writer and in addition to that um yeah so like in addition to all of this so one nice thing about all of that is that restores are what we actually care about and it's what we're actually measuring with all of those approaches but the problem with that the problem with just leaving it at that is that broken restores are fundamentally a lagging indicator of broken backups so we may ask ourselves whether there are any leading indicators that we should monitor as another layer of swiss cheese so to speak or as an optimization just to save us time in the future just to make our system more resilient uh so this one is a little bit pg-backrest specific but i assume you know i've never used barman or wall g but i assume they might have something similar to that or maybe someone's built something on top of it that's similar to this but uh pg-backrest has an info command you can pipe it to head and there is a status row that says okay or not okay you know we could say okay or maybe give you some information around an error and this is really great for humans you know if you're writing automation around this you should probably look into the json output option that pg-backrest has but just like as a visual i find that this is like a pretty good you know a pretty good illustration of a really easy check you could implement right now um you might not get it like out of the box with data dogs postgres monitoring but it's not the hardest check to write just check if the status is okay alert someone if not uh similarly if you pipe it to tail you can get useful information such as when was the most recently recent backup taken um i did not take this yesterday i took this about a week ago so at the time this was a little bit more recent than appears today there's a lot of other useful metadata here as well like you get um you get how much space is being taken i'm not going to touch much on costs today for sake of time but you should be able to cross correlate what is being reported from this tool to what you see in your cloud billing later also um on the theme of you know let's say we don't want to just trust the tool that is supposed to be taking our backups to monitor itself for leading indicators like maybe there's some general notion of corruption we're worried about or we just want like independent uh independent checks here to make sure that our backups are solid we can check s3 what are we checking it for well we can just check if anything is there at all that's a good test generally um there's like there's a bit of a directory structure that you expect to be here we can maybe reverse engineer some of it um i wouldn't get too complicated here this is just defending against that level one boss fight earlier uh is anything actually in s3 is a pretty simple check you can make um in addition while archiving i mentioned that the right ahead log not only gives us is what gives us that power of physical backups it is what gives physical backups it's um you know better characteristics compared to logical backups on recovery points you can achieve and recovery times you can achieve but the downside of that is it is a point of failure like if you have a snapshot of pg data um the way postgres backups work at a low level just depends on wall being present continuously and if any of those are gone or if they're not getting sent to s3 or whatever your backup destination is for some reason or another that means your backups are broken so um some nice things that you get out of the box with datadog um is you get archiving throughput you can put it on a graph uh check you know i have it here split between different clusters but if you zoom into one um a good sign would be that you see it non-zero um at all points of time a bad sign would be if you see it just suddenly disappear unless this is unless you expect that like if you don't expect it to disappear it's a bad sign if it disappears and you should look into it um in addition there is an archiving failure metric that you get out of the box um so you know we can hear we can see some blips here where some clusters apparently have been failing at least once um at least once every once in a while and you know going back to the theme that i mentioned earlier that great monitoring should be quiet when it doesn't need to interrupt someone we may ask the question like why why is this stat non-zero like isn't this bad isn't this something that should wake the dead in the middle of the night or you know wake all the sre's at 2am if this is happening um and i'm not going to dive into the details of here all i'm going to say is like you you need to get a sense with whatever backup solution you're using of what is expected what is anomalous and what is just something you can ignore in this case there are some like weird edge cases you might run into if you are using asynchronous archiving um as an option if you know often you only need this if you are at a certain throughput or above there are some weird things that can lock up with pg backrests implementation in particular um you might not expect this with certain other tools it's um honestly probably a subject for a completely other talk if someone wanted to give it on the current status of asynchronous archiving capabilities with the way that postgres has its archive command system set up um but you know to put it short we don't alert on this we alert you know we don't alert on like we don't we have not tuned our sensitivity of this alert to alert on every single instance where this fails we have a threshold that we've tuned to balance the needs of uh business needs for good backups that we can rely on but also we value our developers we value our members of the on call rotation we value not waking them up in the middle of the night and making their lives uh unhappy so recap every time you need a replica use your backups at least as you know stage one of getting that replica into being in addition to that periodically test a cold restore in a qa or staging environment particularly if you have a need for that to exist anyways if you're doing some sort of test that would benefit from it um on the theme of monitoring you know do as much as you can to visualize the restore process especially if it doesn't have to involve actually alerting someone actually interrupting someone um you know like that slack channel people who are interested can just join and see and that works well enough for us if you're provisioning you know if your throughput of how often you turn through replicas is three orders of magnitude more than us that probably does not scale for you you probably need some other solution or some alerting so context is important um in your organization and last but not least make sure you're monitoring you know especially your alerting monitoring pulls its own weight um and if you do all of this hopefully you have enough layers of swiss cheese to keep your backups working well and keep your you know keep yourself sleeping well at night not worrying about your data disappearing a brief list of acknowledgments i'd like to acknowledge you know my employer for giving me this learning opportunity especially with that old script and helping me prioritize replacing it with more trusted tooling also my team who have sat through this presentation several times as i practiced it for scale in addition michael from you know the founder of pg mustard i practiced this talk with him last week and regardless of what you thought about the delivery here today um i think it's safe to say i did a better job here than back then and that's a wrap any questions biggest horror story um related to well i'd say that like that level three boss fight like i'm i thought the backup was good it all of my tooling and scripting that i've written to check in s3 you know doing everything i can that isn't literally testing the backup testing a restore directly and yet it still doesn't work like that's pretty scary although i'd say that like when i see empty files that have metadata saying they're good that's pretty scary too um yeah i did run into an interesting well it's a little bit diverging from you know the subject of this talk when i just in the process of migrating us from our built-in you know our home rolled backup solution to pg backrest um we were using an archiver known as uh well if you've heard of wall g there was a predecessor to that called wall e that i think is fallen quite a bit out of fashion for a number of reasons but you know it worked for us at the time our you know our restore testing process um our restore testing process pretty much is what i would credit to it working since we'd have to make adjustments but you know there are just a lot of pain points in moving one archiver to another like we had to replace what wall e was doing with what pg back you know with how pg backrest does its archiver process and running two archivers at the same time is just pretty it's pretty complicated there are some bad things i think there's like a ticket that i i created a github issue on on walle's github not not because i expect anyone to fix it i think like the project is archived at this point but just like for visibility if anyone else is running into it um effectively it would you know it's it's it's like a it's like a race condition if you're archiving quickly enough because of how each side is handling uh asynchronous archiving that leaves like gaps in your wall record on the pg backrest side because walle is like directly writing into the um you know archive you know it's kind of writing to something that postgres assumes it's not going to but um yeah i don't have time to i probably probably don't have time for uh this question you know q and a section of the talk to get into all the details there but you know i'd certainly be happy to rant all about that uh offline hand in the back do you have a question oh oh okay what's the alarm any questions so the wall is constantly being backed up or uh yes the question the question was like is the wall being constantly backed up uh yeah well ideally yes uh it is it's a thing that you have to monitor uh one one you know particular failure modes that one should be aware of when monitoring like a postgres backup and recovery system is is the you know is wall archiving falling behind since sometimes if postgres gets a lot of writes whatever is sending you know whatever process is supposed to be sending the right ahead log or you know wall set wall file segments as the terminology goes whatever is supposed to be sending those might just not be able to keep up temporarily you know usually and that's part of why i mentioned like our internal goal um our internal like recovery point objective has a small allowance for a few seconds to a few minutes because sometimes you know we get a burst of writes and we have some things that have been committed on postgres but have not made it into s3 yet uh but yeah more or less you know it's continuously being backed up or at least it should and if it's not then there will be problems oh yeah how big are the log file segments oh the question was how big are the wall file segments and honestly i i so i think it's like a compile time option that you can it's either a compile time option or like a grand unified config option that you can set i i don't remember off the top of my head it's it's on the order of several megabytes and you know another thing to keep in mind is there tends to be like your archive whatever process is doing the archive or tends to compress it beforehand with either z standard or something else uh so you know depending on like how much logical change is actually in there on our qa systems we'll see it ends up in s3 as like these tiny files with almost nothing in it because it just compressed so well and it's just been it's just been switching segments on like kind of a timer process or a scheduled process rather than because it needs to switch to one because so much has been written uh but you know it it shouldn't be too large uh to put it short any more questions thank you for listening thank you test test can everyone hear me yeah cool right so welcome to the second longest titled talk at scale uh i lost too heady um so recovering from data loss despite not having a backup a postgres story my name is jimmy angiolacos uh a little bit about me it's my first time at scale i identify as a systems and database architect i'm based in edinburgh scotland but i'm originally greek i have been using open source for over 24 year for over 25 years and i try to contribute to projects where i can uh one of those projects is postgres i have been using postgres exclusively as a database for uh more than 16 years now i'm writing a book called postgres qr mistakes and how to avoid them and i have co-written a book which is the postgres ql 16 administration cookbook with some other brilliant fellows um and i have also written a postgres extension for performing time series analysis of postgres internal statistics and plotting them into charts which is called pgstat viz so back to the title of the talk recovering from data loss despite not having a backup it's what um that's right the company had no backup it was a real company you're not going to see screenshots from the actual procedure because you know that doesn't i don't have permission to share that and it's been more than a year so it's based on my recollection of this odyssey but it was an actual company with customers that lost their entire database and these things happen more often than you would think so let's set the scene the phone rings on a friday evening at five p.m everyone's ready to go home um i was at home because i work from home so no change there and maybe it wasn't the phone maybe it was zoom because nobody uses phones anymore especially for business but the gist of it is uh the cto of the company comes on the call and he sounds extremely tired and he says that they lost all their data he says they have no backup at that point i'm ready to commiserate him and then he asks can we recover the data so i say no that's it thank you no no wait so i said uh let's figure this out what's happening first of all the database that they lost was critical to their operation the company cannot work without this database because their website is the database it's the contents of the database the service they provide is all in the database their website i find out on the call has been down for more than a week and there's a there's a page that says we're sorry we're having technical difficulties and no information at all for more than a week and their users are starting to complain because a week's fine but more than a week that's an outrage which also means that if the users are starting to grumble their stakeholders are starting to worry as well who you know the people that run the company the people who invest in the company they're starting to worry so what happened they had a disk crash that wiped out their production database server and you may think okay it was just one server with no redundancy so their postgres is gone gonzo their most recent backup that they did have is months old so they really can't use it for the kind of service they provide because it relies on recent on the recency of the data it needs to be up to date they get a quote from a database recovery company and they said we'll see what we can do in two weeks no guarantees it's going to take two weeks but we don't know if we'll be able to recover the data so how does one find themselves in this situation the company used to be a startup and they grew in size and customers and importance over time but and they transitioned to postgres from a less serious database decades ago so they they know postgres they've been using postgres for a long time but they were using pg dump for their backups pg dump is not a backup tool it can complement a backup tool or be used as a backup in an emergency but it's not a backup tool because it doesn't do anything on its own you need to script something which will use pg dump to take a backup and unfortunately the lovely script that they had over you know the past 20 years or so had started silently failing and you know there's all these things that can go wrong without you noticing when you move things around so you may think oh i'll just take this here server and put it on the cloud or put it in a kubernetes pod or something and one of the scripts doesn't like the new environment the very reliable script that you've used for years and years and suddenly there's no error message nothing and it's been silently failing so that's why their last usable backup in quotes was months old and i'm thinking is this really bad luck were they unlucky we know that hardware fails it is not uncommon but there's a glimmer of hope not the database recovery company but a data recovery company has looked at the hard drive and they have managed to recover some files it looks like the physical disk is not damaged and it was probably a controller failure and that's why they recovered the files so the company gives me the dump of the recovered files in order to see if we can construct a database out of those files but they give me files that are randomly distributed in recovery directories that are named 0001 0002 to 100 plus and i'm looking inside these folders and i don't see any structure it's just files jumbled and they're not in the folder where well we've lost the folder names right so but also the files look like they belong to different directories on disk so they shouldn't be together so there's no guarantee that the files in 001 belong in the same directory why is this bad you have to consider how Postgres stores things on disk let's look at that for a little bit so the physical structure of a Postgres database on disk and if you can see if you assume that you're using red hat or a red hat like system it will probably be under var lib pg sql diversion of Postgres especially if you're using the community packages this is where you're going to find your Postgres database and when you perform an ls on this directory you see things like global conf d some configuration files and you also see a directory called base this is the one that holds your data mostly so your tables your indexes they're going to be in base why is it called base haha it's data slash base so that's where the database is so the base directory contains uh subdirectories that hold individual databases the name of these directories is the oid or object identifier that identifies the that database in the postgres catalog so they're going to be incomprehensible names to you so if you have a database called jimmy you won't find anything on the disk that's called jimmy you will see things like one 16582 16587 4 5 and pg sql tmp for temporary files so we mentioned the postgres catalog so there's a view called pg database that you can select from the catalog and if you select object id 16587 that is an identifier that we saw on disk for one of the databases so select star from pg database where oid is this and you get oid the database name finally and you find out that that database is called pg bench which is the example i'm going to use in these slides and other things such as the owner of the database and coding and so on as we mentioned each of these directories subdirectories of base contains the tables and the indexes for these individual databases how do you find which file belongs to which table by selecting the table name or relation name and object identifier of the table and rel file node so relation file node is the file name for the identifier for the file that holds the table from pg class where the the table name is pg bench accounts you can find out how it's stored on disk and which files it's using and you can see that the table pg bench accounts that has oid 16594 is stored in rel file node 1600 1600 so what does that tell you that tells you that unlike databases tables mean their oid may not match their rel file node so you can have a table that has in the postgres catalog oid 16000 something and you can have a different rel file node for that table and you will also find some other files on disk such as the fsm and vm which is the free space map for the table and the visibility map and other useful things and that's how it works for indexes as well they have their own individual files on disk now postgres has something called the segment size and the segment is how big a file can be on disk before they need a second file so tables that are more than one gigabyte because the default segment size in postgres is one gigabyte they get split into multiple files and what you see on disk is one file that's one gigabyte and then the next file that it spills over into is called dot one and if there's another gigabyte then there's the third file called dot two and so on so you can see how your table can keep growing without any restrictions apart from disk space okay so that's how postgres puts files on disk so what is their recovery plan if any the recovery plan is to recreate the data directory structure that we just saw for their lost database with the files that we got from the recovery operation so we basically copy the files inside the directory where we believe they should be so the next step will be after we've copied the files we try to start postgres with the folder that we put the structure in which we will call opt recovery and if postgres starts then we will attempt to perform a pg dump to perform a logical dump of the database so that we can ensure that everything can be read correctly from disk because if there's any corruption then pg dump will stop and we'll say i can't read this file i can't read this index and so on so after we have the dump of the data the plan is to restore the dump to a fresh database you we really shouldn't trust this directory anymore we should extract the data from it and try to start a new from a fresh server next step question marks profit so how do we copy the files inside the data directory it looks scary it's all numbers there's things like hundred and twelve hundred thirteen twelve hundred and if there's hundreds of tables it looks daunting but the reality of it is that oid's below sixteen thousand three hundred eighty four are reserved for system use so they're probably not the tables uh they're certainly not the tables of their database so i can focus on the ones that are over sixteen hundred first yes yes fortunately the file names were saved but there was no structure to the file system it was just a dump of files so i can skip most of these things because they might be common across databases and parts of postgres uh default tables and views so after a lot of typing and copying files back and forth we need to see if postgres will start so the first attempt is we use pgctl to start postgres and we pass the parameter d for the data directory and they say try to start the postgres server from this data directory which is opt recovery so it says waiting for server to start stopped waiting could not start server examine the log output that was not unexpected right let's see what the log says when we examine it so the log says fatal error could not access the status of transaction 803 could not open a file called pg exact slash zero zero zero zero no such file or directory well that's a problem it means that the file either we didn't find the file in in the dump or it wasn't recovered so i start looking for these files and i can't find them in the dump and it's obvious that postgres won't start if that file is not there why it says a boarding startup because of this failure the database is shut down so what is pg exact and what is this zero zero zero file it holds transaction commit state for each transaction on disk so these are files that are usually 256k in size and they hold they store data on the transaction outcome whether it was committed or aborted etc in a weird format which stores the outcome of four transactions per byte so it's two bits per transaction state what we want because now i've decided to fake this file since we don't have it right we want status zero one which is committed for all transactions referred to in that file so we need to fill it with zero one zero one zero one zero one for each byte for these four transactions and the way to do it easily is db so you create a file with 256k of zeros but we actually translate the zeros to octal 125 which gives us zero one zero one zero one zero one and we fill the whole 256k with zero one straightforward so we put the file there and does postgres start now well originally it didn't and then it complained that zero zero zero one was missing and then zero zero zero two so i had to fake all these files and finally postgres started at some point and it said waiting for server to start done server started i'm stunned that this collection of files on disk appears to think it's a working database can we connect to the server so i try psql the default to psql is it tries to connect to database postgres as user postgres uh sorry as the current user but i was logged in as postgres so connection failed fatal database postgres does not exist that is not good but it tells me why it thinks that the database doesn't exist and it says that the database sub directory base slash five is missing i'm like oh okay so i need to find this uh i need to recreate this directory five which is the oid of the postgres database and i need to fill it with files so i find the files for the directory five click the clock does postgres start now yes postgres starts and i am starting to become hopeful at this point so let's say that we want to connect to database pgbench to see what's in there so connect pgbench fatal database pgbench doesn't exist the database sub directory is missing but now i know the oid and i know which directory to recreate and put the uh table files and index files etc that i found inside this directory some more typing and i can connect to pgbench after i've created the directory and copied the files in there so what follows next the next step is to perform the pg dump so let's do it people are waiting right this is taking course over eight hour days on saturday and sunday i said i can't work more than eight hours a day but i will help you so but they're waiting they haven't slept in days right they're waiting for news so i try a pg dump of database pgbench into the file pgbench.dump and it works i am at this point i am ecstatic i call them immediately and i say i have a dump of your data i don't know how corrupt it is but i have a dump of your data are we done is the recovery complete no we need to take this dump and restore it into an empty database server to see what we get it so restoring the dump is pretty simple you just call psql you create an empty database called pgbench and you pass it to the file pgbench dump it's an sql text format file so it just runs sql commands to populate to recreate the schema and populate the database so i get error could not create a unique index um from a table and it says this key is duplicated now that is weird but not really if you consider what we did so what we did was we said every transaction was committed when we faked those files that were missing that we needed in order to start the server so some rows that were dead have now been resurrected and we need to figure out what the current version of the row is so we can get rid of the duplicate that has the same or different data so that repeated itself for about a hundred times and i had to go to the customer and ask them you know i have this customer account that has this data and then this customer account and they said yeah that's more recent so i did that about a hundred times and they confirmed which row we could keep fortunately it seemed to be only a few tables that had this issue and it was only like a hundred times so it didn't take more than a few hours to work through so in the end this was a dramatic save for them we completed the restore and they had a working database with data that appeared to be correct and intact verified by them by sunday afternoon which was less than 48 hours after they got in touch meanwhile their stakeholders were holding a conference call and they were deciding whether to pull the plug on the company this if they lost this data it would have been an extinction event for the company they would have had to work very very hard to regain the trust of all their customers that they lost the data for right so the stakeholders in a conference call the cto jumps in on the call and says we've recovered the data so applause brooks breaks out on the call everyone is relieved because they knew that someone was working on this for the past couple of days but before then they hadn't seen any evidence that the data was actually recoverable so they're really relieved but i'm not relieved because i can't relax until we create a backup of this database that they can actually use because for all i know their disk might fail in the next five minutes and we might have to go through all of this again so i say okay tell your stakeholders but i'm taking a backup this moment so i create a streaming replica of the server immediately and i set it up with rep manager so that they can fail over and i set up backups immediately again with barman is the use the tool that i used and then after i had these things in place i said i can now relax and it's time for a beer so what is the final analysis for this adventure they were extremely lucky extremely lucky because most of their files were recoverable but they were also lucky that postgres has such a clean structure i was able to describe to you so even if you weren't familiar with postgres internals someone can spend like 30 minutes explaining to you how files are made on are put on disk and that gives you an understanding of this cryptic directory with numbers in it right so it's not black magic there's a logic behind all of this and this is why we were able to recover their data it wasn't just one binary blob on disk it was a collection of files and as you pointed out fortunately they had the file names so we could attempt to reconstruct it took a few iterations but eventually we got there so even if you've lost some of the files you can use this methodology to recover as much of your database as is recoverable save whatever can be saved and you can also fake files or reconstruct them correctly files such as pg file node dot map or pg control you can recreate those or you can fake those or you can take them from a running database and they will be the same so the good thing is that postgres complains at startup that this file is missing so it doesn't so it doesn't say error i'm throwing my hands up i don't know what's wrong right it tells you what's wrong and this is why we were able to find identify the files that were causing the problem because they were missing and so on if the file that holds the data for the table is gone then the table is gone but you can attempt to recover the other tables right you can just remove that table from the catalog and proceed or you can dump each table individually right you don't have to perform a full pg dump of the database now what about the right ahead log that is in pg wall that wasn't this wasn't that important for this recovery because whatever was in there is going to be used to recover to the last feasible point when the server starts up so whatever you have is whatever you have if you've lost some files from wall they just won't get replayed and you may have lost the last let's say 16 megabytes of transactions i think that's acceptable compared to losing your entire database so let's see what you should not do and so that you can avoid finding yourself in this situation first of all you should not use pg dump as a backup i mentioned in the other talk i by magnus he said that the same thing he said pg dump is not a backup and i told them then we should change our documentation because as someone pointed out to me a few weeks ago when i told them that pg dump is not a backup he said your documentation says that it is a utility that you can use for backing up postgres so we probably need to send a documentation patch which will get scrutinized and rejected but we need to tell people that pg dump is not suitable as a backup tool because it doesn't support point-in-time recovery the thing that wall is there for right so when you have a pg dump it only captures that moment in time you cannot roll back you cannot recover up to a certain point it's a snapshot essentially of the database plus as a backup tool it is sorely lacking because it doesn't have automation for me if your backup is not automated it's not a backup right manual back backups are not backups because you may need to go to the dentist or you may have something to do and forget about it or someone else forgets to take the backup it needs to be automated it needs to be monitored so that you know if the backup is failed there needs to be alerting to notify someone you know you have no backups you need to run and take a backup now because we'll lose data if something happens now and of course backups need to be tested because as GitLab found out a few years ago they had a backup they hadn't tested it and so they lost a lot of data so in order to avoid doing all this you can just use one of the ready-made backup tools that exist for Postgres like PG Backrest or Barman the two open source tools that dominate the field they're also proprietary solutions that you can use I know that some of them work with Postgres now so really this should come as no surprise you should not use PG dump as a backup and what you should also not do is maintain radio silence because the people that need to use the database will start coming up with theories this is real they came up immediately with a theory that the database was unrecoverable that it was ransomware that some hacker broke into the database and that's why the company is not saying anything the company was not saying anything because they didn't know right when they found out that the files could be recovered but they shared that with their customers when I told them that I now have a dump that we can attempt to recreate they shared that with their customers and when I told them that the database is up and running they immediately shared that with you need to keep everyone informed otherwise it you can suffer reputational damage which sometimes is worse than the actual damage of the data loss right so it wasn't a security breach it was just one of those things that happens a disk went boom and what you should do in that situation is not to panic you don't start running around waving your hands in the air saying oh my god where are we gonna do inform the president inform everyone well you need to think rationally and examine the situation see what you do have see what you don't have and see whether that can lead to a successful recovery what you should do in order not to find yourself in that position is have redundancy streaming replicas are dead easy to set up if you even if your server goes boom if you have a streaming replica you can just start using that immediately you should have automated tested backups as we said when you do recover the data don't touch them make a copy of the recovered files and operate on the recovered files and as we said you need to keep the team informed so even if you're the database person that is attempting to stay in front of a screen for hours and hours trying to recover the data you need to keep the stakeholders informed with what you're doing so another thing that we can learn from this is that the degree of recovery how much data you recover matters but also the speed matters as well that's why we took the decision me and the company stakeholders we decided to fake the transaction IDs to proceed I couldn't it would have taken a very long time to to figure out the correct outcome of every transaction right so it was in the end it was faster to just fake the files pretend that all the transactions were successful and deal with a few duplicates it's better to bring something out faster rather than wait for complete correctness sometimes but of course that is not a decision that I should make right you should as a tech person you should always ask the customer what they want to do because what may be obvious to them may not be obvious to us right they have business requirements they say this table matters recover this table only or recover this table first thank you very much any questions there's one yeah can you hear me just out of curiosity at no point did I hear you it could have been in the clickety-clackety but did you not use the backup the months old backup to simulate the file structure the table like to try and identify there was no file structure it was a logical dump so it was sql commands so I couldn't use that so the pg dump that they took well every pg dump is a logical dump but it doesn't store any of the structure of the database it just stores the data and the schema as sql and sometimes it puts it in the binary format but they only had the text format I was wondering if you had any any thoughts or ideas on like whether there's any room for Postgres core to build out something to protect I guess users of Postgres from themselves in this case or you know you know kind of like is there some strategy that might at least help avoid this failure mode of like we don't know what we don't know like we don't know that our backups aren't really working like I don't know like a really silly thing that probably isn't very useful that I just thought up is like what if Postgres just like refuse to start or right with an error if you know you don't have an archive command set or some things pertaining you know at least as far as Postgres has visibility onto it some things that pertain to backups like Postgres will just say okay you know well you can set some other gc saying like right I don't care about my data this is just a test DB with underscores or camel case or whatever and then then I'll start up but I don't know if you you have any thoughts like maybe not that specifically yeah so Postgres leans on the side of letting the user decide what to do instead of telling them you should do this which gives you more freedom and becomes less annoying because as you said you may have a test database that you don't want to backup you don't want Postgres complaining oh you don't have backups for this database right so what Postgres tends to do is let you shoot yourself in the foot and also there is no official backup tool for Postgres in core intentionally because everyone else has the freedom to develop what they like so there's pgbackrest has focused on different areas of the backup procedure than let's say barman or walley or wall g right there's all these solutions that you're free to use and Postgres doesn't make you use any of them now archive command is not used by most of these tools so you wouldn't get a warning anyway so Postgres does warn you when there's something preventing it to start up but it doesn't warn you against things that you should know such as take a backup they at least using a raid system on their new server I don't know that I remember we restored on to some cloud instance so I guess that's taken care of but it wasn't really my job to tell them what to do I told them what to do for not getting into this situation again in the future I can't tell them where to run their databases or which provider to use or what hardware to use but I could tell them that you really should have a backup you should test it you should have redundancy and so on and that's why I ensured before I left for the day that I created those because if I said well you need to do those things and didn't check to see if they did them they probably wouldn't given their track record what percentage of their weekly expected revenue did you charge for your 48 hours of work um let's say I got something out of this it wasn't my engagement I was working for a company that said you know it's up to you if you want to work during the weekend or we can tell them that we'll start work on this on on Monday so I took the decision because I felt sorry for them really there when when they told me that their database had been down for a week I saw that this company was in deep trouble and I tried to help them uh it's it's it's not really about uh well at that point it wasn't about money for me because I would make the same amount of money whether I help them on a Saturday or a Monday or wherever but okay any more questions great thanks very much thank you Jimmy