 And then later, of course, it makes sense that everyone sees the correct state of the variable. Again, if it's not clear, let's just do it one more time. Now B wants to delete the variable. It's gone for B. But then A still sees the value 4. Again, after a commit, it's gone for everybody and then that's it. Now we have another coherent setup. So in this type of MVCC setup for Postgres itself, it uses something called a transaction ID. You can think of it as like a running counter throughout the database. So it's like one counter which keeps incrementing, thick, thick, thick, thick. And then later, that's going to be how the state of the row is determined across your entire setup. So running through the same example again, there are two values like X min and X max. If you're not able to read it, that's what it says over there. X min, X max and the current transaction ID. So what happens is that at each point in the run, the table itself tells you what should be the earliest transaction which can see the record and what should be the last transaction which can see the record. And using this sort of transaction IDs is how MVCC is being handled. It's like a very simplified setup, but this will help us understand also the later parts because transaction ID, it turns out, is quite crucial to the whole story here. So this sort of setup where the values say 2 and 5 or those ranges are what determines. So if you have a very long running transaction, it's always going to see the value and prevent the table from ever clearing it up. So here we have transaction IDs and they're set up and this is how Postgres does it, but there are of course certain catches to it. It turns out that for a lot of different reasons, the length of the transaction ID is 32 bits. So if you can mentally do it, it's right there, 2 to the power of 32 is about 4 billion. So what that means is that every 2 billion transactions, so there's 2 billion in your past and 2 billion in your future, so any table at a point in time cannot remember more than 2 billion of its history and 2 billion of its future or not remember cannot handle more than that. So this causes certain limitations where every table should be updated or touched once every 2 billion transactions or preferably much earlier and this is usually handled by the demon or it's called the vacuum process. That's an auto vacuum running and we'll see how that affects also. But now that we have that primer behind us, the idea of this talk is also an incident report of sorts where we realize that we were going to run into problems eventually and then let's just see how that goes. So one day I was just working and then this was earlier last year. I wasn't doing too much database administrative stuff but then I get this email from our AWS support team and then they said the instance is showing signs that auto vacuum is not working and then it's the last line. So it says the current age is 1 billion. If you don't fix this before 2 billion, the instance will stop accepting new transactions. Okay, that sounds scary, but we were not sure about how to handle all of this and then we decided, okay, let's see how will D-Day look. When it hits 2 billion, what is the problem that's going to happen? All right, stop. So the whole database just locks up so that what happens is that you lose access to any right activity. The server will shut down. This is, of course, our master instance which is supporting all those e-commerce stores who want to be able to do real-time queries, run aggregations, and all those reporting. So for them, of course, the down... the bottom-line business is the server should not shut down. Of course, there were some saving graces. We had replication setup and we had certain redundancies built into it where the read replicas that we had might still work, but of course that might mean that once right stopped, the stale data is what is going to be read by these instances. So we might still have read access properly, but all the new transactions are just not going to go into the system. So how will D-Day look again? What can we do? Why don't we just promote a standby? Like I told you, we had read replicas. Why don't we just kill off the master? It had 2 billion. Okay, just promote the standby now. Well, turns out the binary replication setup in Postgres replicated also the transaction IDs. So even the replicas will have 2 billion ID in their setup. So again, that's not going to work. Promoting a standby is not as simple as you might think. So what should happen is that the server will restart in something called a single user mode. Or that's something like your administrator only logging in as root and then all the other users being logged out of the system. So with this single user mode, you have to go in and do a manual vacuum of the entire database and that's 750 gigabytes. Going through it even with a very high quality instance is going to take you days and not hours. Then do you really want downtime of days and being able to work through this all while only one user is connected to the database? Not looking pretty. So at this point is when you panic, you become a headless chicken and then you run around seeing, oh, what do we do? But then the idea is to keep calm. We still have our heads on there. So let's just start it out and then see how everything might play out. So then we decide go back to the source. So Postgres itself has a really extensive documentation. I think they're quite proud of it also. And then of course when I look into it, I see that there's a section on transaction ID wraparound. At this point, I'm like hitting myself in the face and thinking, why didn't I think of this before of reading the documentation before? But then of course it's good that it's already there. They do have transaction ID wraparound. It's like a well documented aspect going into this and then you really have to make sure that you're able to plan throughout by being able to avoid this. Of course, so it helped us through this far. Let's again go in and see what type of fields they are supporting. I mean, what additional features that can help us. So we went in, we dig deep into the documentation. We found a lot of additional tools. So there's like system information functions. How do you diagnose the system? There's a system table, PG database, which we'll look at later. There's another system table, PG class, which will again help you diagnose further information about the tables. It's very comprehensive. So if you just dig in deep into it, you're quite sure to be able to find the relevant portions. It's a bit of a hassle, but then keep calm, don't panic, and then just go into it again. So the first step we did was to estimate the damage or the day was going to happen, but then when would it happen or when would 2 billion be the magic number? So there's a function called transaction ID underscore current. That just gives you the current number. And then you look at the clock and look at the current number. It was 1.2 billion. This was close to what the support services, support that people were telling us. They were like 1.2 billion. Watch out, they're going to hit it. So we had this operation run again and again, like at an hourly interval during different times of the day, whether it's peak load, offload, weekends and weekdays, a few times, and then we ran it again and then it went up, right? So at hourly intervals, we saw that it was roughly burning through about 600,000 transactions per hour. So 600k transactions per hour and 1 billion transactions remaining. That comes to about 40 days. So the prognosis was you have 40 days, try to fix all of this and see if you can prevent your database from shutting down. So we decided, okay, let's look at the quick wins. So this was a summary of what our different applications were doing. So you have insert, update, insert, insert, update. So that's like six transactions. And for those familiar with SQL and other aspects, if you don't do any specific operation around it, those are going to contribute six individual transaction IDs. So those are going to wrap around and just cause the burn rate of 600k. So the first thing that we decided was, okay, where the business logic allows for it, just start grouping them together. So by wrapping it around, begin and commit, you're sort of isolating the transactions into batches. This is a good practice, but again, depends on your own logic. If you have something which would happen in a batch, then you might as well do this. And this gave us another 20 days-ish, yeah? So about two months from now, we have two months to work around all of this. So we decided, okay, 60 days. Let's start and then try to diagnose and prevent this from happening. So we go back to the documentation. We look at PG underscore database. That's a system level table, which is present in Postgres. It gives you statistics about the entire cluster of databases that you have. And then over here, I have just given like three examples. There's like a template, categorization, and Postgres. The interesting field is in the middle, which says that pros and XID. So that's like your number, which is supposed to be able to tell you the age of the current relation, or that's like your 1.2 billion number. So PG database helps us to identify the specific database, which is causing the issue. And then we look at PG class again. We look at PG class, and then at this level, we are able to see the relations which are causing the issue. Here, there are again two relations, products, and customers. Again, these are just names of relations in our database. It might be some of the many. At that point, we're having a few hundred thousand, or almost that range. So we had to like dig in deep and find these values. So using, of course, going back to traditional SQL, you go in, you type it all out. This is like one query which we had to run eventually. And then the idea was that this helps us with a listing. So we ordered by the age. So the idea is to not be afraid of like digging in deep, even though we have limited time. The idea is to just use the documentation and just try to see if you can build it all up. So at this point, we have this row over here. If you can't read it, it just says 1.2 billion. So the idea was that this relation over here is what is causing the database to sort of freak out and then think that the value is going to run over. The wraparound is going to happen and the instance is going to shut down. But now we got a name. Now we got a name to the problem. Now we need to see how we can understand it, diagnose it, and see if we can really prevent everything from happening. So the culprit, the name seems very archaic. I mean, it doesn't make any sense because it's not something, obviously not something that we wrote. It sounds serious because it sounds like PGE postgres is handling something behind the scenes. Well, the name is toast. It stands for oversized attribute storage. It's used to handle large values. So when you have a table which has like serious data being stored in each of the columns, the toast table sort of helps you support the underlying user-created relation. So it's a system table and it's used for handling large values in memory. And then we wanted to go in and see what was this actually supporting? Like, there had to be some data that was there, right? The system wouldn't have created out of scratch, out of nowhere. And it turns out it was a, what's called a materialized view, and the name of it was Sales Daily. As the name suggests, it's Sales Daily averages for the different stores. So the Sales Daily is sort of the reason why all of this was happening. So now we have the name of the culprit. Let's just see how far it's going to affect us. So the name, again, it's similar to a table. I call it as a materialized view because this is something which is done in Postgres where you design a view over it and then you load the data also into it. It's primarily used for caching it. You can think of it as similar to a table, but it's more like a view which like mines some table, brings it in. We were using it for aggregation to help our different customers be able to run relatively quick queries. So we didn't want to go into the whole backlog and then process their daily data just for one number. So this sort of cache was helping us be able to give them faster responses. So it's not an essential level. It's sort of like an abstraction on top of our data. Again, now we have the hunt. So we got the culprit, but now we need to be able to fix the problem. We have an idea of what's happening and what do we do? Again, let's just see how far we can like go into this step-by-step and then diagnose or rather fix the issue. So the first step is like any good person, just take your vitamins or just be careful about all of this. So aggressive maintenance. So like I told you, vacuum is like the crucial operation which helps in all of this. So the idea was to be able to vacuum more aggressively. So we started with the basic parameters. There are a number of parameters if you're familiar with postgres. One of the most common ones is auto vacuum workers, which like sets the number of simultaneous background processes which help you process the dying or like the old relations, old transactions. There's of course maintenance work memory, helps with maintenance or allocates how much work memory should go into it. And there's the nap time which is sort of the timing where you allow it to sleep and not affect your bottom line. Again, by tweaking these, these are like the minimum configuration settings you can touch so that you don't do anything drastic with the setup, not affect the availability. Just try to see if this will fix. Well, the answer was no. It was it helpful or rather it was not strong enough. So this helped us to roll back some of the damaging effects that we were wrecking on the system, but then it was just not strong enough to roll back all the damage that had gone through. It was literally like bringing a knife to a gun fight. It's like the Blitzkrieg going on and then we were like trying to hide behind umbrellas. So we're trying to see if we can fix all of this and then number one didn't work. Okay, let's do precision targeting because we went through all the trouble previously. We went through all the trouble of trying to see which toast table went into this, which materialized view was supporting it, what was the age of it and what was underlying data through all of this. So we decided just focus on the relation. So we saw that MatView sales daily. MatView sales daily is the relation. So let's just do a user control back on this operation. So this, if you remember, the way MVCC was set up would make sure that all the relations and transaction IDs are cleared up. So the vacuum operation helps you standardize your table. It also does a little bit of reorganizing the data, but over here our primary idea was to make sure that the IDs are updated to make sure the transaction ID of the table is kept current. Well, again, it didn't work, but it sort of worked. So the idea was that the table, transaction ID was updated, so yeah. But then the PG underscore database, which you remember was the system level overview of the table, that did not. And it was very strange to us because that was something which is just supposed to be a summary of all the underlying relations. Well, it turns out this was something we did not expect because all this time we were running on RDS. And RDS gives you a fair amount of access to the database, but not to the underlying host or real super user access. So the role that we had, RDS underscore super user, maybe they should have just called it not a super user, super user. The idea was that we are able to have admin privileges over the database, but then the scopes and the roles did not allow us to touch all the system tables. We also run into it with some extensions. RDS has fair support for all, for most extensions to Postgres, but some of the extensions you might just not be able to install on RDS due to similar limitations. So again, the RDS super user prevented us from really modifying the PG underscore database. Well, okay, we look back at the clock. It's like 60 days. It lost a while. Never mind. Let's continue forward. We still have some time left. So we decide to yank it out. So the database and the relation in specific view, right, the materialized view was sort of a cache. So it was like helping us with the application developers to return the numbers in a quicker manner. So it was like a cache which helps you roll up the numbers. So we talk to them, we convince them, can you trade off to prevent sole disaster? So we decide, let's just drop the materialized view. This is sort of like a very last minute, I mean, like a very dangerous thing. Then we decide, since it's not a crucial thing, you can always go back, run your sum on the original table. It might take you a few seconds longer. But then since this is the business reporting site, maybe they will be acceptable to this. It's not going to be online with the orders and the stores being processing their transactions. So we decide, let's drop the materialized view. Well, it still didn't work, sort of. The idea was that the whole RDS super user again was still missing the user privileges. And it also tied into something else because the auto vacuum process which was running was still processing our entire database at a fixed serial speed. So even though it had sufficient privileges to process them, it still couldn't catch up and like roll back all the IDs. Again, we go back, we look at the clock, we see there's going to be a long way around on this. So we go back and then we look at this at this point we are thinking we are in it for the long haul. We have like sufficient time being able to process all of this. So let's just run something called a vacuum freeze. The asterisk there just means all. So the command is just vacuum freeze. It runs on your entire database. In the version that we were running, it continued to be a serial operation. So it took a lot of time to run in essence, yeah. So it used to go through your table one by one, all the relations and then it used to update the transaction IDs and the relations to make sure they're all current. The freeze is useful because it doesn't lock the table. If you do a normal vacuum full or like a vacuum itself, there's a lot of locking going on. The freeze is specific to us because we only wanted the transaction IDs to be updated to prevent the wraparound. So we start with this and then with an excruciatingly long time later during which we go through this process and then finally it seemed to have worked. So that was like the saving grace for us that after spending a lot of time running through this manual vacuum freeze process, it finally did fix the database. The idea behind this or we guess from now looking back is that it was able to touch all the relations in the database. It was in a serial order but then by the time it ran through all of this, it was able to do the correct process of updating the system level information at the next schedule time point so that by this time if you're able to manage all the values correctly, it just makes sure that it fixes your entire ID wraparound problem. So yeah, that was tiring for us. It was like a long few weeks and then finally we decided it's not enough if we just solve this problem. We really have to look into what we learned from all of this because the problem was not something... I think Sentry. Sentry had like a similar problem, the monitoring company where they had like a huge list of what happened then we decided we should also like really learn about all this and see how far we can like take this forward. There are a few categories of learnings, right? The first one is when you tell someone this was your learning they just go, that was obvious, why didn't you think of this before? And that for us was like things which we have seen from like a mile away. So large number of tables was something that we were battling. So the very first support email which was there it essentially said that the system is not able to keep with your thousands or at that point many tens of thousands of tables. So we thought that at that point was too late or maybe we should have considered sharding earlier where we have different e-commerce stores and then each store preferably we had like full sharding where each store gets its own identical copy but then at that point we could have really moved it to different hosts or different database setups and prevent all of this from being on the same cluster or same setup. So by having this sort of sharding we could have avoided it but I guess it's still in hindsight. So the idea was that these sort of observations which are obvious and going into it it's better to learn it maybe not in production but in some way or the other before. The other set of learnings is interesting. There are things where you have an idea of some things which might happen but then no one's going to call you out as being very naive or not thinking of this. So they are the non-obvious set of learnings where they are avoidable if you just think about this, if you reason about your system and this is something which I also think the application developers or the sys admins or someone really needs to take ownership of because usually the administrators or like system level architects they consider the hardware hosts, the RAM, the RAID and the disk setup with the IO latencies and then the application developers are often ORM land. They just look at it in terms of application code, they use an ORM and then everything is objects to them. But understanding the schema better I think is worth spending time on where you're able to reason about the different relations you set up and make sure that your performance can be tuned and essentially it helps in a lot of different aspects not just in this specific instance of database administration. The other thing is where a good DBA or like a person who has been working on this a long time helps with streaking the configuration of your setup. A lot of the times most applications, most processing systems they work fine out of the box. So Postgres has same defaults, I'm sure MySQL also has same defaults and then you have like configuration tweaks set up in your specific use case. So if you have something which is going to be a right heavy application or if you have something which is going to be better off with some other specification for your checkpointing or your targeting so in that type of scenario it will be much more useful to just look at the minimum changes that you can make to your configuration to guarantee sufficient performance. Then of course it's the unexpected type of learnings, right? These are the issues which are pretty much outside your visible spectrum where you can sort of think that you're safe you prevent all these problems from happening but then there are known knowns there was someone I think Donald Rumsfeld there are known knowns and then there are unknown unknowns that's my end of the spectrum. So these are issues where you never really have them on your viewpoint you think of them as not really going to happen or you don't even know that that could be a thing. So for these unknown unknowns for us the idea was whether you can really prepare for it maybe the cost of the issue is something that you might not be able to identify but if it translates to the same type of effect say you have an application in which the whole connectivity to the host is lost so in that sense the underlying cost you might not be able to diagnose but if you can plan for the scenario maybe in that sense the unknown unknowns might help for us this blind site was in RDS we did not know that the super user was slightly less privileged so at that point when we were trying to run the manual vacuum on the relation we ended up seeing that it did not update the system level statistics for the table so this is something which with sufficient expertise you might be able to know about the scopes and the roles I guess it's worth it to also dig deep into this and make sure you have an idea whether you're running on bare metal or a hosted provider or you're going even further out and just treating the entire database as a service and then let's just look at the recommendations for us in specific context of what cost our issue and what sort of recommendations we made to ourselves and maybe you can take away a few it's specific to configuration parameters where we saw that auto vacuum there are really a lot of knobs that you can tune it might be good to leave them at default values when you're starting off but then it's good to know that these exist to be able to help you out in scenarios which you may face so the first of these is max workers then there's nap time and then there's cost limit so these are again the idea is not to like go into all of these and then tell you that you have to increase this to do x you have to increase this to do y the idea is that when you face an issue just be prepared to dig in and then see the parameters that are available for configuration we never have the ability to be able to forecast all of them but then knowing where the different stuff exist can help us go a long way so in that sense there's max workers, nap time and cost limit there's again maintenance work memory this helps with a lot of the processing which cleans up the idea is when you face a problem you don't vacuum less, you vacuum more so if you want to keep up you vacuum more and not vacuum less that was the idea that we also learned in terms of general query performance in terms of your own online reporting or your read and write performance we saw that there's a whole bunch of operations or functions and parameters which affected in line with your day to day operations so work memory, check pointing and shared buffers I think there were also previous issues with postgres causing a lot of right amplification and touching disk a long time we also tried to some of the features of how it was saving it check pointing does a lot of writing to disks and bringing back so again these are ideas to know about how to handle them monitoring so this being rootconf and there have been a lot of monitoring talks or tools and performance monitoring and application monitoring the idea is to keep in mind that whatever you start off out of the box might be useful but then based on your application it's also useful to look at context specific or numbers so transaction IDs was something which we realized we would have to keep track of if we had kept track of for a long time that would have been the essential feature which alerts us to all of this so transaction IDs we started recording them seeing how it happens, whether it's going down whether it's going up in terms of high load and essentially preventing the whole scenario unfolding again long live transactions this was a problem which is something which happens frequently across databases or across applications the idea is that if you want the current state of the road to be held out throughout if you have an open transaction that is going to cause problems down the line eventually so the idea is you want to do it short and quick and then just make sure that whatever changes you make can be committed to the disk or can be committed to a permanent state where possible our idea was to monitor our long running transactions and make sure that they do not have too much of an adverse effect on the system IO latency again this ties in together with the long running transactions and being able to finish the operation and doing it to disk and throwing it to disk so the IO latency was a number we started looking at more seriously to be able to make sure that our disks read and write speed are not affected by holding a lot of information of course as I told you that all of this was so the instance itself was initially launched in early 2015 and at that point we were on 9.4 since then there is 9.5, 9.6 and then 10 is going to come out later this year but there have been a lot of good updates and again this helps across the whole spectrum really so if you are able to upgrade to a new version the idea is to go for it the idea is to be able to make use of a lot of the optimizations I guess almost all the teams are on their own product whether it's this database or some monitoring tools I guess there are a lot of features which come out in each new release and then being able to use them is something that you would like as a company to go for you don't want to be stuck on legacy systems and migration remaining a pain always so my advice if you can upgrade always go for it, it might be troublesome hopefully without downtime yeah so specific ignoring all the features that were introduced JsonB support and helping with parallel indices and some sorting techniques but specific to our own use case here parallel vacuum would have helped where you are able to essentially run a vacuum db command with like 5 or 6 workers and then it helps process your whole database in a much shorter time this was in 9.5 and above there was pg visibility I guess this helps with all the scanning and the database who are here who want to help out with monitoring the status so pg visibility gave I think an extra bit in the table which helped with a lot of cleaning up so it helped mark each row as not needing a vacuum and then we could essentially skip the full table scans there's also other improvements in monitoring pg stack activity this is very similar to the monitoring which I think Kolm was talking about in MySQL where pg stack activity gives you the wrong running queries and then being able to see how far each query's performance is taking and how much database memory and work time it's taking out this is not something new in 9.6 but then it's been improved in 9.6 to be able to get a better idea of all your queries so yep these are just a few upgrades and then so that was my talk I guess I finished slightly earlier but adventures in postgres management I worked for semantics 3 and all the pictures were from a friend of mine eb so if you're interested in them they're right there but thank you why weren't you able to drop that materialized view? we did drop the materialized view so what happened was that we had the setup and then we had a materialized view which was only giving a cache to the data we dropped the materialized view and then it was no longer present but then when we went back into the pg underscore database system table the corresponding toast system table was not dropped we cannot drop the toast table so you can drop the materialized view but the toast table which had the big age remained on the record so it was preventing the system from recognizing that the data is no longer relevant thank you Ramana okay great thanks