 So, mae'r byw chrys ac yn amlwg chi i gwn i'r bwysig o ddullus. Yn y cwmhoedd blynedd, roeddwn i'n ddatblygu'r gweithredu cyfaint o'r cyfrannu gocardlus yn y cyfrannu cyfrannu cyfrannu. Yn cyfrannu gocardlus, mae'r api o'r fforddau gwybodaeth, mae'r gweithredu i gwybodaeth http, yn gyfrannu llwyddon cyfrannu, dweud 100 o gyfrannu, Those allocate money from their customers and we transfer that and everyone is generally happy about their lives. Because of what we do we place a high value on that request. It's not something like a social media post or something similar where if that iss not the end of the world. About people's money here so they tend to foresee completely if they are working properly. Now. It can cause emergence to loose customers if we fail while they are in the middle of check out. Felly, ydych yn gweithio yn ysgrifio ymlaen ysgrif�r ac mae'r mynd i'r hyn o'r api a fydd yn dda i'r cyffredinol, mae'r cystafell yn fath i'r gweithio a nhw'n dweud mwy o gael ac mae'n gweithio'n gweithio. A fyddwn i'n dweud ymlaen, mae'n ysgrifio'n ysgrifio fydden o'r mynd i'n dweud, bwy fyddwn i'r wneud yn gwahodraeth a dwi'n gweithio'n dweud ysgrifio'n gweithio'n ddweud ymlaen.Like수per, like plumbing level stuff.So for us up time is key.If we go down too often, our merchants lose money and if we do that, then we lose merchants. You're probably gonna have different concerns to us. Like everyone in this room works at different companies with different restraints and different things that their customers expect of them. But I think ther TRus ym comen ideas that we can extract from the times we failed to be up and available that probably apply to a lot of peoples different situations. Dyna, gallwn i'n gallu'n gweithio'r system dylau, gallwch i'ch gweithio'r cyfrwng, gallwch i'ch gweithio'r cyfrwng o'r cyfrwng, ychydig i'r cyfrwng sy'n gweld ardod i'r llangos. Mae'r cyfrwng yn y ddweud, mae'r cyfrwng ymryd cyfrwng yn y baeth cyfrwng, ac mae'n cyfrwng yn cael eu bod yn cyfrwng ymryd yma roedd, cael y 5 o 5 o 5 o syml. Mae'r cyfrwng yn cyfrwng, a'r cyfrwng yn cael eu cyfrwng. ym mwy o'r pwysig iawn, yn ei wneud i'r wneud'r rai'n hynny, pob edrych i'r pwysig iawn, i mwy o'r rai cyntaf, ond ei wneud i'r pwysig iawn fel y dyfodol yn arwag, ond mae'n rhaid i'r cyntaf ym mwy o'r pwysig iawn. Byddwch chi'n pwysig iawn yn y rherod o'r ddataeth a'r mewn ni. Rwy'n rhaid i ddechrau, rwy'n ddechrau, y dyfu'r pwysig iawn a fyddai sy'n hoff wonnag wedi aralu. Fy fawr, mae'n gwrs gennym wnaeth gyda'n gwyrddio, ac rydych chi'n ymwyaf, mae'n golygu'r rhaid ond o'r bod yn gennym ei ddam. Mae'n gyda'r ddeudu Miol ac mae'n bydd mwy o'r ddod yn yr uwch. Oes y golygon ni, gawaynt mae hwnnw i'n rhai! Mae hyn yn gallu bod yn gweithgrifennu i gyd ymgwyllfa o'r cyflawni. Mae hynny'n gweithio i fynd yn yn ôl. A maen nhw'n ardal i'r fflygu poisoned o'r chweithwyr ond yn ôl. Mae cymdeithas, mae'r cymdeithas yn ymweld yn ymweld yn rwy'r cymdeithas yn y Gygo Cardlys, ac mae'n bod yn ffynwyr o'r gweithio'r storiaeth a'r cyllid yn gyfweld, mae'r bwysig yn cael ei ddweud a'u fwyfyn yn yr ysgrifennu gyda chi'n cymdeithio'r cymdeithas. If you're not familiar with postmortems, I recommend Etsy's article on this and I will be posting all the slides so you can grab the links later, but that's a really good description of how they run postmortems at their company. All of it boils down to this, we believe that failure can be a great teacher and you're a fool if you throw that like potential lesson away. So I'm going to give you a best of of our incidents, the five best down times we've had. They break into three categories. The first one is about running out of a resource on a box, CPU, memory, disk, whatever. The third incident is about weird interactions between your ORM or database driver and the database itself. And the last two are around running clusters of highly available databases. For each of those incidents, I'm going to talk about what happened. I'm going to pull out something we learned. I'm going to try and give you some ideas that you can take away and use. I'm going to summarise at the end with three things that I think you should definitely do and there'll be a little bit of time for questions and then I'll be around in the hall for lunch as well. So, let's get started by talking about running out of resources. Incident number one, the fast migration that wasn't fast. This was the first time that I caused a complete outage of GoCardless in production. So it stuck in my mind pretty well. For a little bit of background, we will make roughly 10 to 20 schema changes per month. Sometimes this dips down to like five or so. I don't think it's ever gone much beyond 20. And in this I'm including everything from creating a new table to back like a new feature like subscriptions or plans, adding a column to a table for similar reasons, or adding an index to an existing column to fix a performance issue that you've hit. It fluctuates, but it usually falls into that range. The problem with schema changes is that they typically hold exclusive locks. They need to modify the actual structures of the database and they need nothing else touching those structures while they do it. The issue is if you have exclusive locks and they're long held, nothing else can use your database. There's plenty of guides on the internet on how to not put yourself into that situation, how to make sure your database schema migrations are fast. There's three bits that come up all the time. The first, don't add columns with defaults. If you add a column with a default on a large table, Postgres is going to sit there holding an exclusive lock on the table while it rewrites the entire thing with the new column. That's bad if you've got like millions or billions of entries in a table because nothing else can touch that table for the time. The second thing is to validate any constraints you add after adding them. In Postgres 9.4 and above, this does not hold an exclusive lock on the table while it validates the constraints it actually holds. So it's generally something that you want to do. Lastly, add any indexes concurrently. What this means is that Postgres will hold a weak lock while it builds the index so the rest of your traffic can continue normally. Following these rules, this migration seems fine. We want to add a refunded Boolean on to payment so that our merchants can now issue refunds and send the money back to the customer. It seems pretty easy. We're adding a column, it's not got a default so this can't cause any kind of problem, right? Well, as I learned on that day, there's this thing where locks can get queued if they can't be acquired. It turns out that when a lock can't be granted, the query enters a queue to acquire that lock. Anything that would also conflict with that lock will queue up behind it. So, let's look at some sample queries. Let's say we have a query that's like select all the customers that have ever made a payment through GoCardless. This is like run-of-the-mill analytics. You want to see maybe how many customers you have, who it is, who your customers who sign up but never make a payment are. That kind of analysis. You then deploy your new feature, which is refunds, but this has to wait behind the analysis query, which is taking quite a while to run because it's a select distinct. What that means is that your regular API queries, such as viewing a single payment, no longer work, they have to queue up behind that thing that's trying to grab the exclusive lock on the table. What that means is that your API is down. In this case, we had a total outage of about 15 seconds, where critical parts of our API payments were no longer available. So, what was going on here? What was the core of the problem? The issue is that we were kind of mixing these two different loads on our database. The first one, that select distinct customers query, is something called online analytical processing. This refers to when you're performing like aggregation over larger datasets. Generally, these queries are slower. They operate over a ton of data, and they look a bit different to your regular API queries. Our regular API queries are generally in the category of online transaction processing. They operate over a small number of rows, and they don't try and perform aggregation functions on the whole. Something we decided after this was, don't OLAP where you OLTP. Move that OLAP load away from your primary database server that's serving API traffic. There's actually very little reason to do that sort of analysis there. You can easily do it on a replica, or you can move it out to some data warehouse, or Google BigQuery, or whatever if you're feeling fancy. It doesn't need to be load on your primary database server. We also learned that it's always wise to set appropriate bounds on your system. Now, this is a slide that we're going to get super familiar with by the end of the talk. Now, pretty much every postmortem that I'm going to run through, we run into something like this. It is better for you to set a bound on some part of the system, for that part of the system to fail in an expected way than for you to set no bounds, and the system still to fail, in a way that you don't understand like the whole thing falls over. In this case, we started setting these two things on all of our schema migrations, lock timeout and statement timeout. What this means is that if a schema migration is waiting to acquire a lock, as we saw before, it will give up after, in our case, 750 milliseconds, meaning that all the regular API traffic that was behind it in the queue can go through and there's no problem. Sure, your schema migration failed, but it is less critical for you to deploy a new version of your software than it is for your API to go to stay up. You can get your developers to deploy a bit later in the day or whatever, but fundamentally, you want your system to be there rather than your new version to be out. We went into way more detail in a blog post on this, so again, I'll be posting the slides later. If you're really interested in this one, we've got a full write-up of it. We've also open sourced that little piece of code that inserts those timeouts around every schema migration. It drops straight in if you're using Ruby on Rails, but also you can probably rip it off in any framework you're using because it's like 100 lines of code tops. We also have a wish list, so what we do there is we set timeouts on lock acquisition and on a statement level, but locks actually pile up if you begin a transaction. Every statement you run inside a transaction keeps the locks that it was granted. What we'd really like is a way of saying, set a timeout on this entire transaction because what could happen is you have someone that does begin transaction, then runs a schema change that's quick, then runs something that's really slow, and the exclusive lock on the schema change is still their blocking queries. There's something that looked a little bit like this in 9.6, but it's only for idle connections, which isn't exactly what we want here. Also, turn this setting on. Log lock weights. What this will give you is visibility into any time your database is waiting on a lock so you can get a general impression of how contended your database is. Cool. Incident number two. Do we just use one and a half terabytes of disk? Answers on a postcard. So, this one started off with an alert that really got our attention. For background, we do not have big data. Generally, as a payments company, your records are fairly small. You're not trawling the web for everything and building an index of it. You've got records of payments that happened. So, until you hit mega, mega scale, you won't have anything that I would actually classify as big data. To be clear about that, I'm talking about our data is measured in gigabytes, not terabytes. We had previously, to this incident, spent about two hours chasing what looked like totally random timeouts between our services and we couldn't connect a pattern to them. We were starting to pull our hair out and then we got an alert saying, hey, your Postgres thing has gone over 1.5 terabytes of disk. When we looked at the graph, we saw this. At some point in the day, something had started cheering up disk on Postgres like nothing else. This is a really horrible alert to receive because somewhere a little bit up there beyond where the graph ends, is how much disk you actually have and we don't want to hit that. So, something wrong had happened earlier in the day to cause us to start cheering up disk. When we went back to the log files from Postgres, we saw some lines that we'd missed. Postgres was creating a ton of temporary files for some reason that we didn't know. Now, the reason we've missed these is that we were looking through the logs for stuff that we already knew about. We were looking for slow queries and stuff like that missing what the actual problem was because we didn't look at the logs as a whole. So, what was it this time? It was a runaway analytics query again. In this case, someone was trying to add a breakdown of how merchants were using our API. So, what different features they were using the most and it would generate an email to centre developers to let them look at that whenever they fancied. The problem is that that joined across many, many tables to do so. It was going from everything from merchant through their payments, through their agreements with customers, through absolutely everything. The result of that join materialised into memory was massive and therefore Postgres started backing it down on to disk in order to provide the result. So, once again, don't overlap with your LTP. We were bitten by basically the same problem and after this, we really started to shift more and more of our workload off to data warehouse instead. So, again, we saw this problem of queries operating over large datasets interacting poorly with queries operating over small datasets. We were lucky in that this was a near miss. We managed to cancel the query before it hit the line of doom at the top of that graph and then we got some time to really think about this and come up with something that would stop it happening again. Once again, we set appropriate bounds on the system. In this case, a flag called tempfile limit, which is a size in kilobytes of the maximum size of temporary files created per Postgres session or connection. What this does is it shields your service from road queries. It means that no one connection can just sit there and fill up the entire disk of your Postgres server, which is a good thing to do. But there is a caveat that comes with this. Generally speaking, your data is going to grow over time. I don't know of companies where the opposite of this is true. Please tell me if you know of one, because I'd be kind of curious. The bits of your system that are liable to break when you set this are things that aggregate large bits of data together. Now, in our case, this was our payment submission run, which happens at some point during the day and has to aggregate everything that wants to be submitted at once. Again, that's kind of an OLAP thing, because we want to do it over fairly live data, like payments that have actually been created today. We couldn't do that off a data warehouse. This was something that broke for us when we added our temp file limit. We had to bump our temp file limit accordingly and optimise some of those queries to not need temp files. Be a little bit wary when turning this on, like some stuff is going to break and some stuff may break down the line. What I would recommend is instrumenting your server for how much temp files are we using, getting that into graphs, setting an alerting threshold of whatever your temp file limit is, and then being a bit before it, and then you will know before this is a problem. Cool. You want to measure these things as you approach your limits. If you're setting a limit, you probably want an alarm at 90% of that limit or whatever. There's one more caveat. Index creation uses temporary files. Whatever you set as your temp file limit is instantly the largest index you can create, which is frustrating because you may have some fairly large tables. The way around this is to disable temp file limit during your schema migrations. Much like we wrapped timeouts around our schema migrations, you can set temp file limit to minus one to disable it, do your migration, and then re-enable it. This requires super user, so you're going to have privileged connections that your schema migrations run through just for this purpose. Cool. Our first two examples have focused mostly on Postgres behaviours. I'd like to move on to those interactions between ORMs and database drivers and Postgres itself. While this may feel like a little bit of a detour at primarily a databases conference, I think it's important to remember that our users care about the combined behaviour of the database and the app. Postgres might be singing along happily in the background, but if the app can't use it, users don't really care what's broken. We just know that your service is down. So, incident three, revenge of the ORM. Let's start by starting a company. Say you've chosen Postgres as your main data store, and you have a couple of clients, and they're like a Rails app or whatever. You don't have mega scale, so you don't need many clients. So your startup starts to have a little bit of success, and you have to add more clients. As this goes on, you realise that because Postgres uses a process per connection, this is quite expensive in terms of memory. So you want to get better utilisation out of your Postgres box, or otherwise you're going to have to provision one with, like, a lot of memory. The typical solution to this is to deploy a proxy called pgBouncer in transaction pooling mode. The way that mode works is it takes many client connections, and it multiplexes the queries coming over those down into a smaller number of connections down into Postgres. The guarantee it provides you is that if you begin a transaction, it will stick you to a back-end connection and not, like, switch you between them. Otherwise transactions will be completely broken. For everything else, though, there's no guarantee of which back-end connection you're going to hit, which means you lose access to session level features. There are bits of Postgres you have to stop using. The three main ones are setting session level variables. Setting transaction level is fine. Session level advisory locks, if they're something you use. And prepared statements. You have to stop using prepared statements from your app. Now, you still can use, like, bind parameters, but you lose the advantages of caching of query plans and stuff like that. So we went through and prepared our apps for this, made sure we weren't using any of those features, and in the places where we had to, for example, when we were setting those timeouts on lock acquisition, those have to go through dedicated connections to Postgres because pgBouncer isn't compatible with them. And we thought we were good. We switched over to transaction pooling because we were starting to run into a bit of a scalability wall with our current Postgres setup. And then the next day, we started seeing this in our exception tracker. So if you don't speak Ruby, this is a null pointer exception, and it's saying, I just tried to call the method fields and it was on a null object. Now, this was coming from deep down inside of the database drivers, like pg, which binds on to libpq. So something really funky was going on. If you Google for this error message, the entire internet tells you to disable prepared statements. Now, we've been through all of our apps. We'd already done this. We went through again and, like, double and triple checked. We were definitely not using prepared statements anymore. So this one was going to be down to us. We ended up syncing three days of three people's time into fixing this problem. And, like, that may seem like a lot, but we basically had no choice. We were either going to enable this feature or run into a scaling wall and start having to spend, like, inordinate amounts of money adding more and more RAM into our Postgres servers. So we had to find out what was going on here. It wasn't an option just to, like, pump this to the backlog. Digging further, we noticed that this was happening more when we deployed applications that were violated, such as a unique index or anything of that kind. We finally started to, like, lose our minds and we decided that we would locally boot up Postgres and our application stack while running TCP dump and, like, looking at it through Wireshark. And while we didn't strictly need those, we got our answer. There were a bunch of extra statements going down the connection that we weren't sending. This was what was going on. And those were running, like, just after the connection was established before we actually sent any of our app queries. Those middle three really stand out. For some reason, it's changing the level of error message that the client receives to panic, the highest level of error. Then it's trying to enable standard conforming strings, which, if you're not familiar with it because it's not really a thing, it's enabled by default now, is there's, like, non-standard conforming strings and you generally want strings that conform with the standard because that's good. So Rails is trying to enable that. And then it sets the error level back down to warning. So it's really weird that we're seeing these there. And the reason that Rails does those three statements is that standard conforming strings is only something you can set on or off in Postgres 8.2 or above. And Rails has this to be compatible with 8.1 and below. So it was saying, don't tell me about any errors. Try this thing that might cause an error. Okay, now tell me about errors again. Let's see how that interacts with PG Bouncer. Let's go back to our diagram and simplify down to a single client because that's all we need to trigger this. We're going to set that level to panic and then down to warning. The first statement goes down back-end connection number one. The second statement goes down to warning number two. Because we're not inside a transaction, there is no guarantee that you're going to stick to an individual back-end. So what happens if you leave client-min messages set to panic? It turns out a whole bunch of useful stuff that you want to hear about like constraint violations is no longer reported to the client. What this means is that the client connection is left in a bad state and the next thing that comes along is going to error out in a really super weird way like that nil error we saw before. So what was the fix? It turns out, by their own policy, Rails didn't even care about this anymore. They were not maintaining compatibility with Postgres 8.1 because it was 2016. That was not a thing they had to care about anymore. So we removed this code from Rails and they were happy about that and the bug stopped happening. So that was all cool. We also wrote a detailed write-up of this. So again, if you want to go and read in full depth and there's links off to all the different bits of stuff we had to look into, then go right for it. Something that was really cool was that Rick Branson, who used to work at Instagram, had seen this same thing with Django and PG Bouncer. Django was also issuing session-level things and when they switched to PG Bouncer with transaction pooling, they ran into very similar problems. So it's not just the Rails ORM but the other problems. It's like fairly widespread. So, what did we learn? There's a lot more to running a database reliably than user bin Postgres. Our stack is just Rails and Postgres, but in debugging this, we had to understand what PG Bouncer was doing. We had to dig through libpq. We dug through Ruby's PG gem, which is the bindings into libpq and we dug through Rails' ORM active record. If you're going to start running Postgres and understanding how it's behaving in prod, you've got to get comfortable with other people's code. You want to remember that it's just code. It's not magic, especially when things are open source. This is a really easy thing to do. You can go to GitHub or wherever, pull the code and start pulling it apart. If you're really stuck, you can probably ask the author for some help. They will explain things to you in general. I've had a lot of success with that. Also, don't trust the ORM. The great thing about frameworks is that they do so much about you. The awful thing about frameworks is that they do so much for you. There's never a good surprise when you find out that your framework is doing something. You usually mean, I was really surprised by that behaviour in a bad way. That's all I'm going to talk about Rails today and ORMs in general. The final section is about running highly available databases. Everything I've talked about before this applies to a single Postgres node. If you care about availability, you do not run a single node of anything because that's kind of a bad time. In this case, I'm going to talk about two incidents where instead of providing higher availability, our cluster actually reduced our availability. Generally, running highly available clusters is trying to answer these two questions. How soon are we back online and is all our data there? You want to be able to answer soon and yes. Cluster is a super opaque word. It doesn't really tell you anything about what's going on. I'm going to boil it down to four pieces that I think are essential to our clustering. The first is consensus. Some way of the nodes agreeing on the state of the world. The second is a state machine. Some way of each node modelling the state of the world outside of it. The nodes agree on that state by using the consensus mechanism. Health checks. We need to be able to understand if the other nodes are up and agree on whether one of them should be primary or not right now. Lastly, you need some sort of scripted actions that can be triggered whenever you have a state change. Say I have a state change from health data unavailable for a node and that node happens to be primary, I want to run the scripted action that is promote new node. I'll say this is a mostly true simplification. Clustering is way more complex than I can get through in those four points but it serves enough for what I'm going to talk about. For clustering, we use a tool called pacemaker. It's kind of a generic clustering tool that you can plug different scripts in to make it compatible with different bits of software. We use the Out-of-the-Box PG SQL script but there are also scripts for MySQL and a whole bunch of other tools. Incident 4. Sorry I cannot hear you. I'm kind of busy. This one starts with us getting paged. Our API had gone down and we'd been alerted to that. When we looked at the rest of our alerts of monitoring, we saw this. This is basically saying that the cluster had decided to demote the current primary and move it to a new node. During that time while we're moving to a new primary node, our API is unavailable because we can't access any data. The reason it had decided that the current primary was unhealthy was that it had replied, sorry I've got too many clients connecting to me already. Now this was kind of surprising to us because we thought we'd provision the right number of connections everywhere so that we couldn't run into situations like this. Especially not for something like a cluster health check which should be running through a privileged connection so it doesn't get affected by applications. We had about 30 seconds of downtime while the primary was moved over to a new node but then everything came back so we were in an okay state by the time we were actually reacting to stuff. In order to stop this from happening again we rolled back a change that we had just deployed to the infra. We'd provisioned a bunch of new asynchronous workers into a new app infrastructure that we were starting to use. So we figured, we've probably put something wrong there or something bad spin those down because it's async work it doesn't really matter if you're not processing it quickly enough right now and we started to investigate from there. The cause, it turned out to be really obvious unlike just a dumb mistake. I mentioned earlier that some of our things need a way of connecting directly to Postgres and not through PG Bouncer and we have a flag that says use direct PG connection, true or false. We had accidentally provisioned all of these asynchronous background workers to go directly to Postgres and Postgres was mighty unhappy about this to say the least. As I mentioned you'll be saying that what about your super user reserve connections like why didn't that save you? Let's run with an example of 100 connections just because it's a nice round number. We need four super user connections to run our things. We've got a couple of replication connections one for health checks, one for backups. So let's say the apps are using 70 connections. There are 26 free in the pool and we've got those four spread across three reserved connections and one non-reserved connection. The reason for that is that three is the default number of reserved connections for super users and we hadn't changed it. Whoops! So under load once we told everything to go directly to Postgres you hit 97 connections on the app. Your three reserved connections to get used and then unfortunately the health check gets rejected because there are no more connections left. So the same lessons again. Set appropriate bounds on your system. In this case rather than having to limit a resource we hadn't thought ahead enough to reserve a resource for an important process. Now I think the only real way to foresee situations like this is to properly comb through the config files of anything you're going to deploy. Really this was because we'd gone like this default is probably fine rather than thinking about what are all the really important super user privilege processes that we have have we set aside enough connections for them. Planning pays off. Cool. So the next incident is like kind of intense so I'm just going to give you a like government mandated break and take a sip of water. Cool. Incident 5. What's in a health check? Our second clustering incident is much more complex than the previous one and it was way more difficult to debug. This time there were plenty of connections available and to figure it out we're going to have to look deep into what goes on during those health checks. If we massively simplify how the health checks run the way pacemaker does it is it uses su to change down to the postgres user on each node and it runs some sort of query against the node individually. It then aggregates those results and decides what to do. In reality it runs a bunch of queries that check things like is this node the primary or a replica and like individual stuff about replication state but we can reproduce this bug by just doing a select now on all of our nodes as the health check. So we have a rule at gocardless which is that for user accounts we put humans into LDAP because it's generally easier to manage than having to like reprovision everything onto your boxes but robot accounts such as postgres live locally on each machine. So the reason we do that is that we never want a downtime in LDAP to turn into a downtime of gocardless.com. Those two things should be like separate failure domains. So following that rule postgres is a robot account you never need to reach out to LDAP to resolve this account. So it should be fairly safe to just like restart your LDAP nodes at will right like your users won't be able to reach anything but you know it's fine the API is going to be up or not. We were rolling out some upgrades to LDAP and during that we said our postgres cluster hit a timeout on its health check and when it hit the timeout it decided to transition to a new primary node in spite of postgres being healthy. The way this works is also the way this command works is that the pacemaker demon runs it and soothes down from root into postgres. Now despite of everything I've just told you when LDAP became unavailable this command blocked and it turns out that the command itself didn't matter you can sue down into postgres and echo hello and it's still going to block. So what was going on? We turned to some of my favourite debugging tools Strace and Ltrace the way Strace works is it attaches to a process and prints every system call that process makes. Ltrace is the same but for C library calls. This may be a little bit controversial but I think these are probably the two quickest and best tools you can learn for your debugging toolkit. The reason I think that is that they're extremely general. They will follow you to any programming language you use they're totally agnostic of what you're writing but they tell you a lot about what your program is doing. If you're not that familiar with them I think a cool starting point is Julia Evans articles online the one I've linked to is a like zine that she drew of how to use Strace and how it's cool. So we ran them and looked specifically for anything to do with user and login calls and these ones stood out get UID returning 0 which is root and get login returning Chris which is me which seems like really unusual because I'm not the pacemaker cluster so that's okay. So what does get login do? Let's look at the man page for that. So according to the man page get login returns the user name associated by the login activity with the controlling terminal of the current process which is super clear and we all know what that means. The two interesting bits of information there are username so it's a username string rather than a UID and login activity. Now login activity is something that follows whoever initially started a session down through any sub processes they started and to show this I wrote a tiny C program which printed out the result of get login. When I run it as me it returns Chris when I sudo up and then run it again it still returns Chris because that's followed me through the session. Linux stores this in two places var run utemp and proctpied login UID and let's have a look at what that is currently set to for our pacemaker cluster. So we find the process ID of pacemaker D we look at its login UID and instantly that does not look like root and predictably we run getent on it and it's me like what like seriously it's me like this is deeply worrying so it turns out that if you log into your system and at least this holds true for Ubuntu and upstart and you restart a service such as pacemaker your user becomes permanently associated with that service until it is next started because of this whole login activity thing so what is su actually doing here why does it use get login it turns out like the code is kind of subtle what it's doing is it's trying to get the past WD struct for the calling user so the reason it needs to know about the password struct is it needs to one allow root to su down to any other account and two it needs to authenticate you if you're not root why actually can you like supply like a password or whatever so let's step through this in sudo code I've replaced the calls with some things that are like I think are slightly more readable because like not everyone's going to know the unix login stack that well so first off let's say su is like getting the UID of the current running user so it's zero it's root get login returns me which is frustrating it then calls a method which gives you the password struct by name so it's passing chris into that then it's comparing the UID in that password struct with the UID it got from get UID which is not the same now the step before that password from name is where we blocked earlier so I can't really see when this conditional is going to be true particularly if I'm like the one who's associated with the login so what it ends up doing is calling a method that gives you the password struct by UID which is zero for root so what situation is the first side of that conditional useful in only if you have duplicate UIDs is this ever actually going to return you a different password struct which to me feels like a really weird way to set up a system like I've never in my life done that and I generally don't like things that collide on IDs it feels weird and like this whole notion of login activity doesn't feel like what Soo actually wants in reality so I'm not sure why it's using that it feels very disconnected from the thing that is actually controlling the process last time I gave this talk I had to wave my hands here and say like maybe there's reasons and then like peace out and leave the room this time I got to do some more digging it turns out you can trace the lineage of this code in Soo back to free BSD to at least 1994 the reason I can't go beyond 1994 is that that comes from an import from SCCS which I do not have on my laptop and do not know how to use so I would have to go down into the BSD 4.4 repose or before then to actually track down where this code is introduced so I went to the free BSD hackers mailing list and they were super helpful like I got a ton of responses on the same day I asked the question so it turns out there are reasons and we have some answers the first one which I didn't hit in Linux but makes a ton of sense in free BSD is that they have an account called tour which is root spelled backwards the reason they have that account is that lots of people want to customise things like the shell that gets used and whatever but they generally favour not doing that to the actual root account in case you introduce like a broken shell or something because you might totally lock yourself out of the system at that point so tour is also UID 0 but you can change its entry and Etsy password to say like use my fancy ZSH or whatever you want it turns out other people are doing the same thing or similar this one person responded to say that they are running like a bunch of mailing lists and they use different unix mail accounts for each user now they didn't want the faff of setting up like file permissions or ACLs or like a shared group or whatever so they just use the same UID all over which is kind of fair if this is like something that's an expected behaviour of the system feel free to use it I still find the whole like duplicate UIDs thing a little bit weird lastly someone linked me to this really cool repo that's been put together which details the history of unix I've not had a chance to go like looking through it yet but someone's gone all the way through the old old old unixes and like put together the history going forward to free BSD as of today and like I really enjoy geeking out on what I like to call code archaeology it's like I think it's super cool being able to trace what we have now back to like the 1970s and it's also can be useful in answering questions like the ones we hit in these incidents and a bit of me wonders if in like 50 or 100 years this will literally be a job title so we need to fix our setup and one more time the answer is to set the appropriate bounds on the system in this case we set an LDAP time out which was much smaller than our health check time out which meant that in the case it couldn't get to LDAP it would just fall back straight to the local user everything would be fine and we wouldn't falsely fail over the cluster we saw again that there's a lot more to our stack than user bin postgresql in this case we had to understand the entire unix login stack including LDAP so you've kind of got this weird like distributed system behaviour inside distributed system and it all combines to like take you down another cool thing I think is like running game days now if you're not familiar with game days they're basically trying to answer questions like what happens if I turn this service off running like actual game days where you go and switch things off will help you to surface problems that you basically won't find in any other way you have to run them on some sort of representative infrastructure now that could be that can be production generally so you can start with like staging environments and then eventually you can go like full netflix mode and use like chaos monkey and stuff like that to start taking things out in production as well again we have published a blog post about this very recently so this details exactly how we run game days at go cardless and hopefully it's kind of useful for anyone who wants to run similar exercises so that's all our postmortems I promise you I'd give you three bits of advice that you could take away and do something with the first one is check your logs there was a ton of interest in context in a lot of our incidents which we would have got if we'd have taken a look properly into our logs rather than just passing them by you probably have too much stuff in there to extract noise on a day to day basis but there are tools that will help you such as PG Badger and aggregation tools like Kibana or Elasticsearch the next one is set appropriate bounds on your system when you're configuring something new or if it's config file and look for anything that looks like you may want to set some sort of limit somewhere it's just generally a good thing to do lastly do not trust your ORMs and database drivers go ahead and fire that up locally in development against a Postgres database and just like see what it does because maybe it will surprise you like Django and Rails did you'll notice that none of these are about clustering I think it's hard to give generalised advice about running HA clusters at least in a little wrap up like this and frankly a lot of people now use hosted service for clustering lastly I've got a couple of thoughts just to sort of close up operating a database is about much more than the database itself we saw this slide a couple of times and I think it's so true you never want to think of Postgres just by itself you want to think about it in the context of the distributed system that you're actually running lastly this is one of my favourite tweets ever it's from a guy called Scott Andreas he's saying that not once have I regretted spending unbounded amounts of time investigating something fishy I think if you care about reliability of your systems you generally want to have this attitude you find something fishy and you dig into it you will learn things you will improve your knowledge and you will improve the reliability of your systems thank you so just before I launch into questions we are hiring for site reliability engineers so if you want to do this kind of stuff and much more and you like building reliable stuff come chat but now the real thing questions how many transactions are we talking about here and was any of it at big time so can't give exact sorry the question was how many transactions are we talking about in these less than a minute outages or exact numbers on each one what was the second part of your question sorry how many transactions do you process a second okay so our transactions per second number isn't public at least as of yet but yes oh sorry to the second bit which was is this at peak times yes it has been at peak times question at the back so the question was around have we considered like using manual failover in two of the incidents we had auto failover actually taking us down there are like empirically more situations where the auto failover has saved us and like we've stayed up or we've been back up more quickly than we would have been if we'd like page an ops person in the middle of the night to do the failover so we're going to stick with it but I totally get the point of like this is a risky thing and you generally don't want auto failover to be too aggressive it's something you want to tune is like your timeouts on when it considers a node to be unhealthy for example so yes it is a balance I agree with you but I don't think going back to manual failovers is suitable for us I can't see any more so if you want to chat just like more relaxed I'll be out in the hall anyway so thank you all very much