 I'm going to introduce us a little bit. As you can see, there are three names. There were three names, but only two people here. Come to them in a minute. Introducing myself, I've been doing free software since before it was named open source. All those dates are when I officially started as a developer, not as a user. And yeah, I'm still doing it. So that means it's 23 years now. The idea for this presentation came up when Jim and I talked at scale. He was about, I told him about this customer of ours and he said, well, that's interesting. That should attract people that people would like to hear about it. And I figured, okay, let's do it. I put the presentation in for the CFP and then figured out, hmm, I really don't touch the system myself. So as soon as technical questions come up that are a little deeper into detail, I'm kinda lost. So the idea was to bring Bernd, our technical lead on Postgres because he's leading the team working on that customer. Unfortunately, he got sick and couldn't travel. So we had to change plans. He's still listed there, but I bought Julian with me instead, who's working on that team. So he knows as much about the system as does Bernd. You can see he doesn't work for us as long as Bernd does, but most of his work, a lot of his work is on exactly that customer system, which by the way, I'm not allowed to name. Sorry about that. This is just introduction about the company. Going to skip that one completely. Coming back to the original source of that special customer situation. You see here, this is admittedly already 10 years old, but numbers still hold true. Total cost of ownership for running computer systems. And as you can see, the biggest cost is stuff. But a lot of people think about reducing the total cost of ownership, think about reducing licenses, right? But licenses is not all. As you see, it's a small part actually. And the worst thing I've heard was a company, actually it was a government unit. They had more than 90% of the whole IT budget was already fixed for salaries and maintenance costs. There's not much left. So that brings us to this customer of ours. Just picture, a company. Nowadays we speak of e-commerce, but way back when mail order, home shopping, whatever you call it, they started with mail order. Nowadays it's online, DV channel, I don't know. But in a situation where you face competition from traditional retailers, but also from other and larger e-commerce systems. In particular, when you're in an area where also smaller prices things are, there's a lot of cost pressure. Get the pressure, the cost down. And you need a great bit of performance, right? Because you're going to sell, you're trying to sell a lot of stuff. So they came up with the idea of getting it all on power. Actually, I don't know why, because that was before we started working with them. Maybe it's just historical. But anyway, they also came up with the idea of using both squares. And at that point, you're in the position where you say, okay, they already had power. So I assume power was the historical one they were using. And correct me if I'm wrong, they use both operating systems, right? AIX and Linux. Way back when it was just AIX. Okay, okay. So, as you can see, the first question then is, where do I get my Postgres? It's not just about having it running. I need updates and everything. I don't want to do my configure make, make install stuff every time and build my own team for that. And besides, you might need some professional help with one or the other thing. So that's where we came into play, which is exactly where I leave the stage to have him talk to you about the technical details. Can you hear me? That seems to work. Hi. So I'm Julian. I think Michael has introduced me as fairly good as already. So this customer we're talking about has been around since the early 90s. And this is one of the reasons where Michael doesn't recall the times, what doesn't recall what they did when they started because this is like 20 years ago. But starting in the early 90s, they were selling products to their customers via brochures or tailor shopping or whatever. And they very quickly had to adopt to e-commerce because this is exactly the area where you want to be in, especially for this kind of workflow. So if you're starting to use IT on the bigger scale in the early 90s, this is like the dark ages of IT, you know, things were horrible back then and you had to use whatever came to be, was available. So of course the infrastructure was growing and growing until they reached their potential. But eventually they had to change a few parts, had to adopt and in I think 2005, they changed to Postgres QL and have not go since. And maybe one thing about historically growing, we're still using like OS2 Ecom station with rec scripts to connect to the Postgres server. This is so old, most people don't even know it. And I only know it because this very customer uses it. But this is for instance, one thing by Postgres is great. It's open and it is easily accessible by anything regardless how far forward it is because you can just, if it wasn't supported, you could make it support it. But anyway, so I said we started like 2005. And we went with Power 5, which was current at the time. And Michael already mentioned it, we are using AIX and the Linux operating system. AIX is only used for the application server. We wouldn't use it for Postgres. Certainly you could, but we as an open source company have way more expertise at Linux. And of course, Postgres has more exposure to Linux. So it is a very good idea to keep it there. And besides, there are no real downsides. So yeah, this is where the way we went. Power has quite a few good reasons to come into existence. One of them is you may want centralized IT infrastructure. That means Power has the ability to provide for a long time, to provide good virtualization or power virtualization with the L paths. And you can, for instance, have bigger machines in your data center and use them for different things. Which is normal today. It wasn't 10 years ago. And besides, the virtualization of power is pretty cheap. So it is basically one of the very good, but one of the good reasons why you wanna use it. Talking about Linux, we started with Susie. Oh, it's less 10 actually. It went well, but we eventually switched to rel. Basically because the support was better for the tools that we needed. I'm not trying to bash on Susie here. This is, there's a part of course was great, but it was about the tools that we needed. For instance, SpaceMaker and the whole high availability cluster. So eventually we went to rel, and we don't regret it since. Oh yes, yes, we will, yes, yes. We'll come to this in a second. Yeah, all right. So let's get a bit into detail. We've started using power, since power five, as I mentioned, it always adapted to the new technologies. And currently we are on power eight. A fairly big machine. I don't think you know power or their pricing or their models. This is an SA24. It is roughly like a hundred cores in our setup, roughly half a terabyte of RAM. Two machines. And right now running rel seven, of course, and Postgres nine four, the setup. I think we migrated earlier this year. So of course you wouldn't wanna go with nine five yet. And nine four is doing a very well, good job so far. We have two major databases in play. This is the logistics database and the inventory database. They are basically the logistics database is where the customer buys things, right? It stores the products and the customer buys something, then it is marked and people at the warehouse get informed about this. And when they walk by, the system tells them to actually put it into the basket and it may or may not get transported to the post office. But it should be. Yes, and the machines are not co-located, of course. Even more so, the hosts are not co-located as well. So we have two big databases running on power at the moment and they try to be on different sides of our hardware. So one is on the one machine, one is on the other. Although we are virtualized, we are virtualized via LPA, we try not to have everything on the running on one host. Of course, there are always buffers, there are always buses that are busy if you have a lot of pressure on one system. So we try to spread it out as early as possible. We have, this is of course one of the reasons why we wanna, it was influenced by the problem that we need to guarantee response times. I mentioned that the warehouse, for instance, needs to, whether the warehouse people walk by and they get informed if something is in their area that they have to pick up. So of course, if the database is reporting very slowly, they can work very slowly, right? Because they have to wait on each aisle, see if there's something interesting for them. So yes, Postgres has to be very responsive here. And not just for simple queries, we put the whole business logic, all big parts of the business logic into the database, running PLPG SQL scripts for various, various occasions. And, but it's working absolutely fine right now. The setup, this is just, this is one of the databases, for instance, one of the hosts. It's a primary and a secondary node on the active passive pacemakers setup. We have pacemaker and corrosion to basically do the cluster managing. And one primary Postgres server that is accessing the shared storage of the son somewhere and is running the database. So we can actually switch the database over to the code standby whenever pacemaker decides that he has to and can easily fail over. As you can see over there, we are always trying to be able to divide the load on the machine. So we have these two machines and each time the LPAR of the active node is on the other side than the one of the passive node. That we try to enforce for various reasons, for instance, just to keep the average load of the host a little lower. Talking about clustering, for database we decided it's a great idea to use storage-based depth, so SBD. This is a poison pill mechanism actually, where the databases always look down onto the storage and a demon decides if he's still connected to the storage, what the status is, and if he can't reach the storage anymore, this SBD device, they will just take the poison pill, as you would say, the host will shut down. This is a great thing for databases because the database just can't work without storage. So it is the way to ensure that a database, just one database is running. I will come to this later though. We're in on the storage sizes, roughly half a terabyte per database, and we have like 100 gigabytes of RAM per database. Yes, so this is the basic setup for both of the machines. You have to take the slack twice, two times. Actually the image on the left, on the far left, tries to illustrate this. You have one LPAR that has a standby on the very other one, and the same way around. We are heavily utilizing streaming, a streaming application at this point. From the standby, we take the while records, stores them on another power machine, so we can make reports, make backups, make whatever we like. It's currently working very well. We still have room for improvement. We have one streaming at the moment. Maybe it'll be more, we will see about this. Backups actually is a very cool thing to take from a slave, because of course it is a lot of load on the system, and you just don't want that. Okay, so expanding these two images, these two setups into one image, you can see we are in quite the environment. We have the logistics, the internal departments, like warehouse, like sales, and so on, who are accessing all the databases. We have the tele-shopping, which is required to get real-time information to display on the television. We have a backup instance running like, I think 1.5 terabytes of data at the moment. We always try to export unneeded information from the live databases, and on top of that, we have a web shop that is currently running on all the correct, although I have to say we will change this soon. This will be an x86 PostgreSQL server, and even more logic will be pushed downwards into the logistics database. So it is going to be more important, and actually I think this is the last oracle in play at the moment. All right, talking about Corrosing and PaceMaker, we do use REL, but we went with our own builds. So we took the PaceMaker and Corrosing ports from Center 7 for various reasons. One of them is history, the second of them is we needed to build our own SPD devices, and actually this has been merged upstream at the moment. So if you are using the cluster labs fence agents, they include our SPD agent, and works very well so far. So this is exactly the one you would, it is proven because it is already working in this very setup. For Postgres, we use the PGDG packages, built them as well, got a few modifications to it. So we're using the power instruction set. Small difference, it is changing things, but to a certain degree, this is not a showstopper or a big advantage, it's a few percent, roughly like five to 10, depending. And we can't use the LDAP support as a great reason. You've seen the oracle right before, and the oracle FDW, and the LDAP capabilities of Postgres just don't work together. There are symbols having the same names because the oracle FDW has, they already has the symbols for LDAP because it needs LDAP, so it will break if you use it. And we don't use LDAP on the Postgres side, so we just ensure it is not breaking at this time, which is a pretty good idea for a database and high availability setup. We built our own, so we have a building infrastructure. This is our Jenkins, and some of you may know this because this is a built-for member called Chop. He's building for power, right? So yeah, you will see he won't fail a lot of times, but as a built-for member, I think this is pretty normal. You see, we've migrated from Postgres 9.1 to 9.4, we made the switch eventually, I think earlier this year, we were running 9.1 before. Migrations are always expensive, always take time, so we tried to avoid it a little, waited till 9.4 is mature enough, and so we changed early. Yeah, building Postgres, building pacemaker as well, but I think you figured. Going into configuration, for instance, the logistics database has like 48 calls at the moment. This is half a power machine. As I said, we have like 100 calls, I think 96 to be honest, and 48 go into one logistics database. It's, you will see a few of the performance metrics is later on. It is a little under load sometimes. Huge pages, I believe huge pages are always a topic if you're dealing with major machines. We use huge pages, but not the transparent ones. Has a pretty good reason. Transparent huge pages are not slower, but they are in danger of not working. Because the not transparent huge pages, you have to allocate at the boot time, right? So you're guaranteed, so when the system comes up, you get this certain amount of memory that is used for your pages. If you have transparent huge pages, it may happen that if you allocate often enough and a very fragmented memory, that you can't find enough of a big enough single point of memory or area of memory that you can actually use for these huge pages. So it will fail. So we went for the not transparent ones. Has a performance impact, like take a 10% maybe. But that's about it actually. We do more of course. We use tuned if you know well, you know the tuned demon that just tries to optimize the system regarding to certain schemas. For instance, we went for the throughput performance schema that just tries to optimize a few of the schedulers. For instance, IO scheduler, CPU scheduler. And we do change the CPU scheduler as well. You may notice this has been discussed frequently on the Postgres mailing lists now. It's just an optimization to make the CPU not migrate a process before a certain time. Which in this case tries to translate to how long a normal query takes so the query can run before it's migrated because we hope that the process is already gone by that time. Yeah, transparent huge pages are not in use. I believe you already figured by what I said before. Going a bit into detail, we have 100 gigabytes of RAM that's big, that is very big as I see it for the database of 500 gigabytes. But we need to be very cache-centric. We need to have those response times so people can work. Shared buffers are on like 10%, but I believe you all know how Postgres configuration works. The shared buffers should for many cases be fed by the file system cache. So it is sufficient to have this case on like 10% and works very well. Maybe you can see, understand the metric. We try to draw the hits actually in red and there is a small blue portion. Now and then it is not in the Postgres cache and it has to be cache from the file system cache, but in general it's working. And there's a big chance that we have 100 gigabytes of RAM, so whatever gets pulled from the file system cache, it's most likely already cached somewhere and doesn't have to be pulled from the storage. Talking about storage, we store like 30 gigabytes of wall every day. Maybe you see the peak at night. There are some clustering mechanisms, vacuum folds just to keep things a little smoother sometimes. Things fragment even in database and sometimes you have this big sequential scan that you just can't get rid of. Sometimes it is a good idea to cluster and this is what we actually do and if you know how streaming replication works, if you really cluster or vacuum a whole table, the information will be replicated to the slave. So for us, the slave sometimes lacks a little behind because he gets fed like five gigabytes in a few minutes and needs to apply those, just doesn't work all the time. But it's not a big of a problem. Talking about load, we have on average like 800 TPS. This doesn't sound too much. If you're running PG Bench on a regular basis, 500 TPS is something you can achieve on your own, but this is not PG Bench TPS, this is not a single key lookup, this is store procedures, this is business logic, this is a whole process of someone buying an object. Of course, on the holidays around Christmas, people buy more, so it is fair to assume that we have a lot more traffic. One thing maybe to keep in mind, these values are aggregated. So these are like hours or days maybe. So there are peaks, there are huge peaks momentarily that the system has to keep up with, but on average like 800 or 1500 TPS are the result. System load, I mentioned we have like 50 cores, but we understand that even if you need way more cores than your average load, because we need to report quickly, we need to make it work smoothly. So actually we have on single seconds, we go to loads like 10, 20 maybe, even higher depending on what exactly is asked at the moment, but yet we try to, yet we can answer them always in a few milliseconds, and you're sure. Good in terms of query response time. Depends, highly depends. For my understanding, good is everything like 100 to 300 milliseconds, depending on what exactly has been done. And our goal is always to make things as smooth as possible. There has to be, there should be no locking, there should be no data being requested from the storage. Things have to go fast. And if that's the basis for the query for the runtime, I believe this is a very good result. And then again, we have this fuzzy demand that things have to go fast. People work on this, they are no machines. This is not the stock market that has to respond to something in like 10 milliseconds. This is, these are people trying to work there and they don't work faster than like 100 milliseconds. They need a certain time to react as well. Yeah, sure, sure. There were always moments where the system had solar hiccups, couldn't really point it down to the hardware. It was always something regarding the software, regarding the query, maintenance calls, of course. But that's basically about it. It's, we have seen smaller hiccups on the host side already, but they have never been noticed by the client, actually. There was always a pacemaker complaining about things getting a little sharp. But the front end was never really touched by this. All right, so going into backups. I said we have this streaming set of the slave and we take backups from the slave. Now, if you're familiar with Postgres, you know that taking backups from the slave is a hard thing because the slave cannot synchronize its snapshots and so you have to do it in a single transaction, which you don't want. This takes ages. If you have 500 gigabytes and you have to pull them all via one thread, this may take a long time. So what we do is we go to the slave, we can't stop it if we like to have a consistent state but right now we don't have to because we figured it is way more efficient to pull exactly these tables that we need and dump them into the storage. So actually, this is not a full database backup. It's really just dumping the important information. I believe this is something that is very often enough because we could recover from this. If we take a hit and everything's gone, we could always just take the important queues, restore them and things would be up and running in a few minutes. Apart from this, we take snapshots from everything else now and then but as they don't change and newer changes are not always required, it is sufficient to take them on a daily or maybe monthly basis even. Yes, one thing that is about to come is point-in-time recovery that we are not using right now. We have the idea that the master will copy his data set into the storage that will then be mounted on the slave which can recover, which can test the backup and keep it or withdraw it and request a new one depending on what we want it to. May come soon. First of all, we have to get the Oracle Rec, have the Oracle Rec gone by then and then we'll do more on the backups and I believe then we have about everything that we ever wanted. Oh, absolutely, absolutely. The customer is very grateful about it. It's working fine. It's a great corporation to be fair because they know what they're doing. So do we and it's just working very well because as I said, there are hiccups sometimes as in every system but they've never been, they've never influenced the front end so far at least to a certain degree not. Right, so I think that's about it. Yeah, I believe it is statically growing. They are currently expanding to a certain degree but wouldn't know because most of the load really gets taken by the Oracle Rec at the moment which is not very difficult things. It's just, you know, there's key value lookups, things that people are trying to, are gathering and this of course scales with the, not even with how many people buy things but how people use the internet, how people are looking more into websites now than they have been a few years ago. But this doesn't influence the database at all at the moment. It will maybe but this information will come once we get quite rid of the rack. Right, so any more questions? Otherwise I think we are done so far.