 So good good afternoon everybody, so my name is Alexander Kukushkin. Don't try to pronounce it. It's mighty hard Yeah, so I Today I'm going to talk about our experience at the land of running postgres on Kubernetes at a big scale so just a few words about me Yeah So in in the postgres community people know me as the patron a guy because I'm the author and The main the biggest contributor to a patron a project which is implementing postgres high availability and automatic failover It's already seven years recently recently since I joined the land. I Have a Twitter account. You can follow me. You can post a picture about this talk and slides are already published on the conference website Yeah, so how many of you know about about the land? Like for people who don't the land is a European company fashion retailer We are selling fashion Clothes shoes online and we are working in 17 European countries. So we are quite a huge From the point of number of employees at the moment we have about 15,000 of them and At the same time, we have a huge technology department 2000 people roughly And we run a lot of postgres and today We are going to I'm going to talk about how we run postgres on Kubernetes. Why we are doing it How we are running how spela and patron help us to run postgres with on Kubernetes. I Am Give brief details about how postgres operator operator makes our life much easier than Without such automation and last but not least. I'm not going to showcase only like happy pass there are apparently some Some problems and you also need to aware like if you are going to such trip So postgres on Kubernetes like first like about Kubernetes at Zalanda at the moment we have more than 140 Kubernetes clusters 50 50 it's a production and test accounts basically not every team but some cost unit gets its own Kubernetes account Kubernetes cluster for production and they also get Kubernetes cluster for like for test environment We really need to isolate production and test and we encourage people not to run any test Systems on production and other way around because sometimes such clusters are treated differently for example test clusters we may run with the help of EC2 instances Which are not like persistent, but they are cheaper Deployment to production could be done only with the help of CI CD system nobody has direct access to production clusters By default if you want to get something get access to production you have to request access explicitly or you have to create an incident ticket and Because apparently if you're on call and you sometimes may not get anybody to help you but Just you cannot take a second to pod and do something although you can look logs without requesting access Basically, you have some access, but it's read only So And now coming closer to Postgres since we have quite a few accounts like Kubernetes clusters We also run a lot of Postgres on Kubernetes and at the moment there are like more than 100 1400 clusters postgres highly highly available Clusters on Kubernetes actually might be this number a bit higher because it always changes people deploy stuff to people delete stuff Sometimes we don't even aware that it's happening Yeah, since you all very familiar with Kubernetes I don't think that I should pay a lot of attention on this slide But like we can compare Kubernetes with some traditional infrastructure like if in If on premise we have some sort of physical service and on Kubernetes you call it node On physical machine you can run virtual machine and on Kubernetes. It's a pod like with the container and so on and so forth What's important there are two different type type of nodes. There is a master node which runs Kubernetes API server and There might be also in the master node atcd server. It might be somewhere else And there are worker nodes. It's a node where our applications are running like including our postgres Postgres running in the container and container is located inside the pod Postgres apparently is a stateful application and from the beginning Docker like when it was introduced it wasn't really meant to be used for stateful stuff and so the same applies just for Kubernetes itself But since people want to want to run some stateful workloads on Kubernetes Some new features were introduced first. They call it Pet set but when it become production it was renamed to stateful set stateful sets it's a Kubernetes entity which guarantees us the fixed number of pods like you specify. I won't run three pods like with postgres and it will guarantee that there will be always three pods in this deployment and such pods always have predictable and unique identifiers names and Depending on how you configure it During the deployment it will either create all of them at the same time or it will do it sequentially like one after another The second very important part what we need some persistent volume where we will keep our data Without persistency it's it's tough like if all pods Going down at the same time and you don't have persistent storage like you have the only option Restore from the backup, but with persistent storage which is actually quite good abstracted by Kubernetes itself You can get your data like back when your pot is resurrected and Kubernetes supports different storage types via plugins if you're running on Amazon you can use EBS if you're running on Azure you can use Azure disk Like there are some certain solutions for on-premise like ice I SCSI or NFS There are even there are even open source storage solutions like Glusterfs or stuffs Coming closer like in order to run Any kind of application on Kubernetes we need to have an image like docker image with container Like Kubernetes supports not only docker containers, there are different kind of kind of containers, but I I'm pretty sure that like docker is most popular and We also use docker like in our docker image. We package all postgres versions starting from 95 93 and up to the latest one Now it's a 12 Why because upon start we want to specify which version we want to run and why why do we keep this? old versions Because sometimes we have to migrate Databases to the cloud which are still or old but after migration we usually Upgrade them to the like latest major version Yeah, and one more point why why do we keep all? Versions together because for major upgrade we have to have binaries of both versions like the old version Which we are migrating from and the new version where we are migrating to like also in the Our docker container we package Patroni we package backup tools like Wally waltzy There is also available PGQ and PG bouncer. It might be anti-patterns because We may start PGQ demon and PG bouncer in the same container Why are we doing so right now? There is no good reason but like To do it on Kubernetes, but we are using the same container just on normally C2 instances and Our infrastructure doesn't really allow to start more than one container at the same time Yes for high availability. We are using Patroni and configuration of Like the whole setup is done as usual via environment variables So what is Patroni Patroni? It's an automatic failover solution for postgres It's written in Python and it integrates natively with Kubernetes. You don't need any kind of short-party Dependencies like a cd console or the keeper you can just deploy it on Kubernetes and get you highly available postgres Basically Patroni makes postgres the first class citizens on Kubernetes but it's not only about Implementing high availability Patroni helps to automate a lot of other topics like deployment of new cluster When we have to execute any DB and get the new data directory Patroni really helps when you scale in or scale out when you increase number of pods and decrease number of pods It also helps with to manage postgres configuration You just update it once and such update will be applied to all pods in your highly available cluster and Patroni will take care of it So how does it look? We have two nodes on every node. We will get pod Like demo zero and demo two like in the specific examples. These pods are belonging to the stateful set And name of the stateful set is demo Like naming is always predictable and unique The very first pod in the stateful set will get suffix zero and the next one will get suffix one and so on Patroni is using either config maps or endpoints to do leader election and In this specific case there is an endpoint and service which Patroni is using for a leader election and service is used to access to provide the single endpoint for applications which you run in the cluster in the Kubernetes cluster and Applications don't really need and want to know which pod they have to connect to they just connect to the service and service It's a standard Kubernetes singe We keep data as I explained to an external volume and we Send archive all wall segments and send base like doing base backup once a day to S3 because it's available on AWS, but S3 is not really a requirement Wally and Walgic and work with different kind of cloud storages Like how to deploy all this stuff like the very Kubernetes way you write a lot of YAML code it's not not an easy task because Such YAML files easily grow up to a thousand lines sometimes even longer but it you can deploy the new cluster but After all like if you want to get more resources like CPU resources memory resources You have to update stateful set. You have to do somehow rolling upgrade of pods It doesn't help us to create like some objects inside our postgres clusters like users or databases and the biggest problem for us it turned to be the Kubernetes rolling upgrade Kubernetes is evolving really fast and we get major releases like every quarter probably and when you apply the New version new major version you have to rotate all worker nodes. It's not notes where we run postgres And If you are lucky What may happen like let's imagine we have a cluster like across different availability zones like availability zones in AWS. It's a different data centers and We really want to distribute our like to get truly through high availability we want to distribute our postgres clusters across all zones and When we're doing rolling upgrade first Kubernetes decides to decommission one node. So what's happening pods which are living on these nodes of course dine Patronium does failover and now we get primary on like different nodes in different pods Like what happened? It's all it's already first failover like for a few clusters Now we get the new node and pods like which were Deleted now they resurrected on this new node and the running as a replica The next step next note again decommissioned and we get bunch of other failovers That's not nice either and but we still have the third node which needs to be upgraded on Kubernetes And with the third note, it's the same story we get again a few failovers and some it turns to be a nightmare like Just if you didn't count and I counted for you for cluster a we had three failovers because we had three pods And this is and very unlike like unhappy case, but it's quite likely that it's happens But in average you will get between three and one so to failovers if you run postgres clusters with three pods so we really thought about improving improving on that and we really have to orchestrate all things much better and if you Just analyze what's going on like what is a class like what is the life cycle of postgres clusters? You first deploy. It's a very easy part. You can even Take a helm implement helm chart and deploy easily with one comment But it's just very first step cluster leaves much longer like It's very often people want to increase resources or decrease resources create some users sync Passwords create databases and so on and so forth and this is a cycle After all maybe cluster will be decommissioned So our goal was to automate everything like deployments cluster upgrades like including upgrades of Kubernetes itself including upgrades of docker image for postgres minor upgrade user management and of course we want to minimize number of failovers during every upgrade So we implemented the lambda postgres operator So are you familiar about operator concept in Kubernetes? Okay, so for those who are not familiar operator. It acts like a human and it Basically encapsulates all human knowledge about operating certain Resource in our case postgres operator acts like a DBA Who manages postgres clusters on premise? Let's say in case if node Died it needs to like they take care about Replacement in case if you want to change some postgres configuration operator also is able to do it It also can extend volumes Even without any interruption in service it can help you manage resources of Kubernetes Of your docker container like memory resources and CPU resources it can create databases create some users for applications and What's important and very convenient for users in the database it creates the corresponding secret object and Application will just mount the secret and connect to the database like you don't keep any passwords in the source code That's amazing and of course We implemented smart rolling upgrades so how this Looks like First of all in order to deploy the new cluster we have to create just a very simple Postgres manifest. It's a short yaml document. There are a few key moments. You have to provide size for the volume like how much volume you want to have for your postgres postgres version and number of instances and name for Specific cluster you want to create and operator will do everything for you it watches for such new YAML manifest appearing in the Kubernetes CRG it Starts creating stateful set it creates service creates endpoint creates secrets for like postgres super user and for Postgres replication user And it creates roles and databases When when everything got up and running and how operator help us with rolling upgrades Operator can subscribe to events in the Kubernetes itself about nodes to be decommissioned and if operator notice that This specific node will be decommissioned it will take some action like First it knows all right There are two primaries running on the node which needs to be decommissioned therefore it will move replicas from Other nodes which to be decommissioned to new nodes It's very simple task. It simply terminates the sports and they Will not be scheduled to the node switch to be decommissioned So we already migrated all replicas to the new nodes at this moment. Then operator does a switch over Basically primaries which are running on the nodes to be decommissioned will be switched to replicas on the nodes which already knew so Yeah, basically we got to the situation Where the whole rolling upgrade does only one switch over like on This example we are dealing only with three clusters, but it's all for the human It's already not very easy to follow what was going on just think about if you run hundreds of postgres clusters and Kubernetes and Will any human be able to handle such rolling upgrade? I think not Yeah, so now the interesting part like what problems we hit The most common like the very first what what what we had to deal with. It's a AWS infrastructure itself Like like any cloud API AWS doesn't allow you to hit their API services Like like always Constantly and in case if you are doing if you are hitting services like API services too often They start throttling you and you get a rate limit exceeded exception such exception not like not very good because it prevents Attaching and detaching EBS volumes to the new nodes and Those volumes contain our data and without data we cannot start our replicas and It causes apparently some delays and Running highly available cluster just with one pod Not not very good. You don't lose data. You don't lose transactions But in case if the very last pod is going down you don't have Like your cluster anymore Sometimes you see two instances are like failing like any hardware usually AWS notifies you in advance. Okay, so this instance is running on some problematic hardware and you need to take care about like retirement just Restart the instance and it will be rescheduled on some other hardware but in some cases it just Sort of dies, but it it it's becoming not available But from the API perspective and if you look into cloud watch console You see that it's instance is being shut shut shut shut it down and such shut down takes ages like up to 30 minutes and During this 30 minutes you cannot detach volumes from such By the instance you like you cannot reuse these volumes and again, it's not very good for availability although Like failover happens patroni works and you get your master back like in within 30 seconds that's not really problem of Kubernetes per se but it's Happening sometimes that postgres runs out of disk space and Since it cannot write it it will stop itself Patroni of course tries to start it up and we get into the loop Shutdown start promote and again postgres tries to Process transactions try to write write something into x-log cannot and such you get into such crash back loop So basically disk space must be monitored There are a bunch of reasons why we didn't implement after extent although it's very easy to do it in the cloud nowadays Like if you if you analyze all cases when disk space really went down to zero there are very few cases like Natural data growth which is like which supposed to be like this like it's very often we see that For example replica was broken. So we didn't pay attention and didn't help operator to repair it like therefore we get x-log full Because of Replication slot Yes Sometimes like application behaving very weirdly and they Generate quite a few postgres logs Because of constant connects disconnects sometimes human do stupid stuff basically we decided Up to now not implement after extent and just do monitoring and react on it like sometimes it's much easier just to clean up or reduce log level and In order to mitigate the problem But if you implement after extent basically our bill will blow up. It's very interesting case like Orem Causes a lot of problems like for DBA and that's very unusual actually stuff. It's not about Generating some bad query which produces weird query plan. It's about Orem's generating so many unique queries that it manages to fill up Pg star statements up to two gigabytes and Wally doesn't like it. So like if file take takes on disk more than 1.5 gigabytes Complaints. Oh something is wrong. It's wrong. It's not postgres cluster. You are not supposed to have files larger than half 1.5 gigabytes and Like if you look into Pg star statements, you will see that there are queries with like bunch of Like it's the same query but in the in you get like a lot of different combination of parameters another issue like it's again not really Kubernetes issue but because we are not In trust in wall g so much we are still with Wally and Wally is doing exclusive backups And if you terminate postgres while it is in exclusive backup mode You get backup label in the data directory and basically it prevents such Note to start up as a replica. It seems that it's it is being restored from the backup. Although. It's not true like as of now I managed to solve this issue by removing backup label under certain certain condition from Backup from data directory when the image is starting up, but it is solved not by patron it is solved just in the in this pillow and now the Very interesting case. It's about out of memory killer like I I familiar with out of memory killer like so Yeah, so like in the container usually you specify that container is allowed to Allocate up to shorten like to use up to shorten amount of memory like and Don't Kubernetes. It's very important because on the same note you Don't want to run more containers than it's possible than it is possible to fit and Like in this specific case postgres was killed with minus Signal 9 and you get this is from postgres log file but It's not really clear like we don't have any human connected to this port or like we know for sure that there is nothing like What can use signal 9 to kill the postgres and But luckily we can exactly to a port and run the message and we see alright So there is all involved, but process ID is different Because in the container you get a different process ID as on the host it makes things harder to investigate and Like when you run on premise or when when you run postgres just on dedicated machine usually you adjust all score To avoid postgres being killed like in the container it doesn't make much much sense because like we have only postgres and patron running and Kubernetes will decide to kill some some process anyway, and neither of those is good and We actually were really puzzled Why it was happening in this specific case it was probably container with 8 gigabytes memory limit and somehow See group managed to Get 6 gigabyte memory usage on On one of the processes It's not really possible Like not one but like if you sum up all postgres processes it managed like to get 6 gigabytes. I have no idea how it's happening Well, I have some Suspects and we'll come back to it later like another case of out of memory. It's more funny actually Like we run Cube CTL get pods and you we see that container in the pot was restarted seven times All right, let's see why what's going on we describe our pod and see at the very end some events Which we registered on this pot and it tells that container was restarted due to sandbox changed change So what's this like? Like of course Google helps and after googling a little bit We again run the message and see very interesting picture that all Processes in inside our container have the same om score adjust minus 998 like and it turned out to be that Such score is assigned when Memory request and memory limit are exactly the same. It's a guaranteed Quality of service like this is something what Kubernetes does and Since all processes In which belong to our pod Have the same score it decides to kill something and it kills actually not postgres not patroni It kills pose process Which is actually taken a very little memory and killing this post Process is causing like this change of sandbox So this is another case of om which is not very obvious Like how can we like mitigate this om like first of all like it's What was obvious? Let's try to reduce shared buffers and get like Reduce memory usage in our container the second one was not so obvious and it wasn't really easy to catch what's going on We are still running on the same linux machine and The same rules about virtual memory are applied like if process is writing something on disk You get a lot of dirty pages and But if you run in the container see group will count these dirty pages to your process Basically your process using only like I don't know three megabytes, but it managed to write three gigabytes of data Which is not yet flushed on disk. So your process will From the perspective of C group Will take this three go three gigabytes of memory and That's actually one of the reasons why Probably it is killed by all Unfortunately, we cannot assign this VM like the use the same trick per Single pot or ppc in single container like we can set such values and limit number of dirty buffers in memory only like per node like we didn't yet roll it out to production but Applying it manually on some most problematic clusters when we really knew that There was out of memory events happening. It really helped to improve like availability and service continuity Docker itself Also, like especially this postgres starting from postgres 11 can get into some problems when query I Produces parallel hash join it wants to use some shared memory and of course it is trying to Locate the shared memory in def hm and by default in the docker. It's 64 megabytes and Apparently in this case it wanted to write something and there is no left space on device It's not causing crush of postgres, but your query gets Cancelled like you get an error So basically either we should disable parallel query, which is not very good or we should somehow think about increasing Volume size of Def hm Like with the docker itself. It's very easy. You just apply one argument when you do docker run this kubernetes It wasn't so obvious. You can't just you like the only option is mounting the volume To def hm like explicitly and we implemented in it an operator There is an option like enable hm volume equals true and you will get Volume mounted and basically up to half of memory on the node could be used But you still cannot exceed the memory limit postgres like sometimes like people want to run Logical replication or do logical decoding for change data capture and it's not possible to retain information about Slot position on the failover like patron is trying to do its best and it Don't open new connections to the new promoted primary until it creates a logical slot But it still does not guarantee that there are no events lost Another problem which is fortunately solved It's this fatal too many connections like postgres didn't make any difference between replication connection and just general connection from the application Like Back all days probably in postgres 9.0 it was possible to make replication user as super user and it will be In the like super user it is all Reserved connections now it's not and luckily in postgres 12 we finally get dedicated Connection slots for replication connections And I actually put a bit of my time into implementing this feature Yeah, and like in my opinion postgres also lacks connection built-in connection puller like Our next attempt like our next goal in the operator deploy Pg bouncer like On demand like for like if you want to run pg bouncer in front of your postgres cluster Operator will do it for you like very soon, but it's not yet there Last but not least human errors like sometimes people doing very stupid stuff like We want to save resources as much as possible and therefore specify like CPU requests and limits very small memory like requests and limits very small and like what can it start like or Like because just patron itself takes Something like 35 megabytes Postgres also needs some resources and if you don't provide this resources basically It cannot Is a start or like on killer will come soon and Terminate someone like the second one one people really Assume we are dealing with big data like 10 gigabytes We want to like run our pot with 32 gig gigs of memory but the issue is that in the node pool we have only Notes of certain size and if your Request and if your limit is exceeding resources available on the node basically such pot cannot be scheduled That's not nice But it's a human error couple of times we Observed that people Removed service account which is used by Spill in Patroni and service account. It's a sink which allows you to connect to Kubernetes API and As long as it it was removed basically you lose connection to the Kubernetes API Patroni I cannot update leader lock and restarts postgres in redone Yeah, why one would do this like What is this object? Let's remove it and see what's happening and apparently people did it twice Yeah, and And the very last it's a yamal like it's supposed to be a human readable and writable But in Kubernetes, it's neither of those Especially when the manifest becomes one thousand lines long Like in our case Like manifest is not that big but people still do stupid mistake like some bad formatting so we came up with web UI where you can Fill up a few fields and get your manifest ready like in the test cluster There is another button. Please create it and for production You can just copy this manifest and like Paste into a file and commit to git repository Yeah, this web UI is also open sourced So, yeah, we are nearly done like postgres operator helps us to manage like more than 1.5,000 postgres clusters distributed across like nearly one how 100 culinates accounts with minimal effort We are sleeping at night like we are not waking up because Some another node died Like without operator it wouldn't be possible but right now just single engineer is they is able to take care of the whole fleet with a whole hurt of postgres clusters In the cloud and especially on Kubernetes you really have to be prepared to deal with absolutely new problems Which you never ever experienced before like going down through the whole Infrastructure layer sometimes is not easy and you really Starts reading kernel mailing list reading a linux source code and Sometimes it helps sometimes it doesn't and sometimes you have to guess like with virtual memory and dirty pages because it's usually such Spike of dirty pages happens very quickly like just within a few seconds and usual regular monitoring does some snapshots every minute It's not able to catch Yeah, and as long as you Found the problem you really have to think about applying permanent fix or work around like in in the operator or in Patroni in spillo and like any anywhere like in order to avoid this problem in the future Like everything what I was talking about it's open source like Postgres operators open source project Patroni like open source entry yesterday to surpassed 3000 github stars spillo. It's a docker container which includes Patronian postgres and it's open source. Yeah, that was it from my side If you have any questions Can you speak louder, please? Oh? Yeah, so like this web UI is a part of postgres operator. It's in postgres operator repages Okay, so the question was like what what what this inter like web UI interface which I showed on Yeah here Okay, so So the question is what what will happen if the Kubernetes control play will die Like in our case we run two API nodes for every cluster and if one node is dying the second one like continues to work and since Patroni is using the Kubernetes service to connect to API Kubernetes API service, but basically it will try try to like Do the same read or write against and another Kubernetes API like in unhappy case. Yes, like if both clusters are unavailable API Clusters API nodes are unavailable postgres will go read only The question is what is the size of that databases we are using like the most common case? I would say on Kubernetes for us like There's a small clusters like less than one gigabyte of data But there are a few clusters which are a few terabytes like up to four or five but basically there are no limitations in Patroni or Like Kubernetes about the size of the cluster like the datam the amount of data might be very big You should understand it But if nobody is reading this data just for example writing like it's very easy to keep it Like if you don't access it So the question is like whether there is a recommendation about max connection limit kind of There are good practices Not to have very huge max connections like in our Setup we usually when the new cluster is deployed max connections is calculated from memory allocated on the pot the bare minimum is 100 the maximum we set 1000 but in a few cases people actually ask to have more but Like the more connections are open the like more unpredictable Unpredictable problems you will have because sometimes it's it's postgres internals basically So the question is what's the difference between failover and switch over like just for me I define the failover as event like someone happy event when the node is Like goes down like Unpredictably and switch over is this is something what is Triggered manually either by human or by postgres operator basically during switchover you shut down postgres Gracefully like doing smart shutdown. It makes sure that everything is written on disk and until the very last bit is replicated to to the available replicas Basically during switchover you like it's very unlikely that you will lose something during failover Like if you are not using synchronous replication, there are chances that some transactions might be might be lost Also, how switchover for rolling update works. This is a question in Patroni, there is a rest IPA interface and postgres operator Simply calls Patroni rest API. Please do switch over from this pod to that pod and Patroni like does everything else on its own Without involving like operator like or any third party basically Patroni notice, okay, so there is a switch over request It means that we want to shut shut down the current primary in order to minimize down time Patroni Actually first doing a checkpoint Explicitly and only after that it calls smart pg controls shut down smart When shut down completed Patroni releases leader lock and The replica which supposed to be the new primary Notices alright, so there is no leader lock and I am the favorite candidate to become the new primary It grabs the leader lock and promotes you usually takes couple of seconds like this from the moment when you start this shut down and until the New primary is promoted and become writable. Oh Yeah, so the question is like Where we were running postgres before We started with Kubernetes like it's actually more than a bit more than two years ago And what what is the difference whether it's cheaper or more expensive? Basically, it's a long story Yeah, yeah, like at the lander we started postgres on premise like the next step was AWS and easy to instances in after scaling group and we still have more than 100 clusters running on easy to and But Kubernetes has its own benefits. Maybe it's not always cheaper because of overhead Which which is coming from my master nodes or Kubernetes master nodes But it really like operator really help us to manage so many Postgres clusters like it's a without Kubernetes and postgres operator like managing one and a half thousand cluster is not really possible I think and when teams Ask what do we want? Application team wants to deploy new cluster. They just create such manifest. They don't even come to us like our Responsibility just to make sure that the whole infrastructure is up and running and you always get the primary like in every cluster Yeah