 Hello everyone, my name is Konal Jain and this is Yuen, we work for C-Vent. Thank you for joining us late, just making sure it's still working. Basically, we're going to be talking about how C-Vent uses some of the DevOps tool out there for managing Postgres, and how we got into Postgres. So just a quick mandatory introduction slide. C-Vent is one of the biggest software provider for meetings and event management. Interestingly, we're not hosting Postgres Conference, but it was founded back in 1999. Over the last 17 years, it's grown into 2,400 employees around the world with 166,000 users, a whole bunch of events, and 65 registrations processed across all these events, and we've processed about six billion in payment fees up to now on counting. So how do we get introduced into Postgres world? This was C-Vent's legacy architecture, a classic client-server model. This was prior to 2015. We were primarily using Microsoft Stack for most part with SQL Server, and ideally, a user would log in, they were passed in through F5 load balancing into our web application tier, which is coded in C-Sharp and ESP.NET for most part. It's some MVC in there, and then that layer talked through SQL Server on the backend. As we started growing, we realized that a single SQL Server instance wasn't helping us. So we started splitting it out, trying to scale it a little bit. So we went from having an OLTP dedicated server, an OLAP dedicated server to more of read-only servers so that we could redirect some of the read-only traffic there, see if we could get that scaling in. Apparently, that didn't help either. So around 2015, our technology team made a decision to start moving away from this architecture to sew up essential service-oriented architecture. More of microservices architecture where we wanted to try and split our modules into more smaller manageable microservice which could be scaled as needed. So these are some of the primary reasons we started looking down that road. Like I said, loosely coupled code allowed us to grow infrastructure and architecture separately. It was split into different services and they were less dependent on each other. It also allowed us to move away from using Microsoft products exclusively. So we started writing or looking at writing services in Java. Some of our services are also using Node. Then there was a scalability component. Part of it was we were running into this issue where we could scale our web tier pretty quickly, but we couldn't get the database to scale. There was also some issues around scaling web tiers to across different data centers and stuff. Then there was the reuse ability component where we could reuse a certain piece of code over and over again as opposed to having that being stored in the database and separate store procedures and stuff. Most of our application logic also resided in databases in client-server architecture. So we wanted to move that out to into its own layer that way we could scale that separately. So this is how our current architecture looks. We still have the old thing in place because of the size of our databases, we couldn't just migrate everything over. Our SQL server databases, just the old TP instances about 14 terabytes and growing right now. So it wasn't an easy switch plus it's the instance that drives the company. So you can't just go out there and justify the management that I'm going to just move everything over and hope everything works. So we came up with this architecture where we started doing shared services. What it meant is we took certain modules, broke it up into multiple services, and had the services use different data stores. For RDVMS, we moved away from SQL server to Postgres. That's how we got introduced to Postgres. And for all our search services, we used Elasticsearch and our document store and cache layer is Couchbase. So we moved our cache layer into Couchbase too at that point. This architecture allowed us to selectively move modules and split them into services as opposed to doing a complete switch in one shot. It also allowed us to move data gradually, supposed to trying to move the entire 14 terabytes from just old DPM to multiple Postgres instances. One of the more driving factors behind this was we could also write our own migration around moving data from SQL server to Postgres using other features like there were trigger-based stuff that we wrote which would allow us to move data. At the same time, it was inserted in SQL server to move it out into a RabbitMQ, which doesn't show up here, but moves to RabbitMQ and queues up there and then there are services that consume that data. Since our modules were broken up into multiple services, same piece of data was being consumed by multiple services. And as part of this architecture, each service got their own Postgres database. So that grew us from having, starting with a couple Postgres databases to about 45 production clusters and a little over 60 dev clusters. So when we realized that it was growing more faster than we could build them, we decided to look into DevOps tool. So this is how our typical database life cycle looks at Sieven. We started with provisioning using CPS, which is our in-home or in-house-built provisioning system. Then we install and configure Postgres completely using Chef. There is no manual intervention there and all our deployment is handled by Jenkins and Liquibase. A Java services accesses the Postgres instance using JDBC driver and all the queries. There are no procedures or functions involved here. So we use ORMs completely, so the preferred way of going. So we use MyBattice as the ORM for services. These are some of the other monitoring support tools we have. So backup management is, again, done using Stun using Barman and that is completely automated via Chef too. So the entire setup of Barman, including creating config files to where it goes and all that stuff is all Chefed up. Again, no manual intervention. For monitoring, we primarily use DataDog and Splunk. Those are our primary monitoring tools. DataDog was chosen as the monitoring tool because it is used across the company for more than just databases. So it was a preferred choice sort of tool and Splunk is where we put the Postgres logs across our multiple clusters. It allows us a central place for looking at the logs it's supposed to trying to log into each of the cluster and looking at what's going on. And then there's PgBatter, which we run periodically for running performance analysis based on different clusters when we start seeing problems with them. This is an on-demand thing, so we don't really have anything where we run PgBatter and store the report. We could run it as and when needed. Then we also have an in-grown tool called DBInventory. What it is is the tool allows us to collect statistics across server-level and database-level statistics across all our 45 clusters in production, some of the clusters in Dev. And the tools also work with, this tool also works with SQL Server too. So it allows us to look at overall picture and helps us in capacity planning. It's been tremendously helpful. For maintenance stuff, upgrading versions and stuff, when primarily using PgUpgrade, recently for some of our critical services which couldn't afford a lot of downtime, but still have a lot of data, we have used PgLogical as an upgrade path where we basically replicated the data over to the new cluster because it allows you to go across version and the cutout, like the downtime for those services went from hours to just barely a few minutes. And then we also have some Jenkins jobs which are essentially used for maintenance purposes, like some of the Jenkins jobs also allows us to copy or refresh our development areas, development clusters with production data as needed so that we could run more detailed testing and stuff on it. So, going to start with CPS first or provisioning a virtual machine. So in really complex terms, CPS is a custom bold framework that ties and orchestrates various tool sets and allows us to provision a VM. Simply put, it's a command line tool for provisioning virtual machine. Run a command, boom, you have a CPS tool, a virtual machine provision. So one of the main reasons for using or going down the CPS route, and this was built by infrastructure team in-house again, was because it was taking us too long to provision a VM. I mean, prior to 2014, it was taking us like 1500 minutes to just get a VM provisioned because of the amount of the servers that were requested. It also was a problem because we were having inconsistencies based on who or which person built the VM. We weren't having the right OS patches that were misses in OS settings, the drives weren't the way they wanted them. And so it was getting increasingly difficult to not just get a VM provision, but get provisioned correctly. And so this was a problem. And after CPS, all of that problems, or at least most of them have gone away, more importantly, our VM downtime is down to 22 minutes. So it takes us about 22 minutes to spin up a brand new VM, doesn't matter if it's in production or any other place. It also works across data centers because we've got multiple of those so you could spin it up in data centers. It also allows us to configure EC2 instances. So that was sort of important because we wanted the capability to have stuff done in AWS as cloud. So CPS workflow overall is pretty simple. It gets the IP from Infoblox, creates the VM, configures the OS, runs the Chef Bootstrap, which is essentially installing the Chef client on it so that it can communicate with the Chef server. It just runs the basic OS stuff at that point, adds the VM to correct active directory. We still use 80s. And then adds it to the various monitoring tools we have. And at the end, it just provides a status to the end user saying, hey, it's successful or it failed at a certain step. But this is how it looks pictorically. So you could go in and the user has three different ways as of now to provision a VM. You could put a command in Slack chat room and 20 minutes later you would have your VM ready. Or you could do it through a service catalog or a question service now, which is our IT management software infrastructure uses. Or you could do it on your command line using the Knives CLI command. All three of them essentially send in a request to service now, which passes it on to a WIP behind which we have multiple servers for lower balancing and the tools integration host makes different API calls serially to get IPs from Infoblox, we use VMware as our virtual machines host. It then uses Cobbler to get the OS settings done and OS installed. Our primary hardware is our HP and Dell. So right now our CPS only works with HP and Dell. Those are our primary vendors. It then goes in and configures Chef, adds it to Active Directory and we add monitoring tools on it. Or adds the VM or brand new VM to the monitoring tool. We use Xabix and Datadoc both. Xabix is currently being phased out. So mostly everything goes to Datadoc. And this is how it looks. So I can go in and I can just say Knives CPS, give me this. 20 minutes later I had that server ready. It's the quick command line thing. It allows you to specify if you want a production server development, what network you want in, what specifications of the server are. So all of those are, the details of most of this stuff is put in CPS, is put in a service now. Where you define how a dev cross tab here equates to a four core 16 gig, three hard drive machine sort of thing. And then there is the OS version and what do you want the server name to be or how the host name should look like. That was also sort of very important because of our multiple data centers, our host names are named in a way where we could identify them easily by looking at the host name to say what data center this machine resided. Including separate numbers for AWS. So then once the VM is ready, installing and configuring Postgres was an ecological step. So we, after we realized we were having to do this over and over and over again and not have to make mistakes, decided to use Chef. So for those of you who don't know, Chef is essentially an automation tool that allows you to configure infrastructure, deploy and manage them over and over again. Some of the other similar tools are Puppet, Ansible, people use Docker, people have also used Kubernetes to do some magic around this. So it's a preference thing. Our company preferred Chef. So just a quick introduction, Chef has three separate components. There's the workstation part, which is essentially your laptop, desktop, wherever you're developing, how your Chef configuration is gonna look. So you create something all that's cookbooks and recipes, which essentially define my this VM that I am gonna deploy Chef on is gonna have Postgres in three disks for data archive and wall logs. And it's gonna have this version of Postgres, a certain version of Postgres with a certain patch. It's also gonna have certain OS patches. It's gonna pre-define with who can access those VMs, stuff like that. Usually you do all of that in ChefDK, which is essentially Chef's development kit. Once you're done with that, you publish your recipe out to Chef server. Then there's also clients, which are essentially the VMs or the nodes is what Chef calls them. What it is is essentially these are the new VMs that you're gonna be putting Chef client on so they can communicate with the Chef server and get the latest copy of your configuration. So if we ever decide to update a certain patch instead of logging in to each of the separate 45 different servers, a certain OS patch, or decide to install a certain new tool or a new, like do a young install, mail-axe or whatever, I don't have to do it across all 45 servers individually. I could put that in a recipe in a cookbook, load it on Chef and add that recipe in a single command to all my 45 servers. In about 25 minutes, all my servers will have that package available. So that helped us tremendously because it made maintenance across all of these servers very easy, they were all consistent. We knew exactly what was going on with the server, what patch they were at, what softwares that were installed on them and how they behaved. And so then we started with creating a cookbook or something that would automate our Postgres install. We started with a community cookbook that's available on Chef and soon realized that this wasn't gonna work out for us because we wanted to do a lot more with it. So we completely customized it, essentially created our own. So as of now, the cookbook does start with provisioning logical volume. So when we get a VM provision via CPS, you would have raw volumes available, which are mounted and available, a certain, actually, which are not even, which are mounted on a certain drive and then we create LVMs on that for people who are familiar with logical volume. Manager, it allows you to split a physical volume into more logical volumes. Since these are VMs, our Postgres essentially has three drives on it, each Postgres server, a big data drive, you have a separate drive for wall logs and an archive drive, archive wall log drive. We then install and initialize the Postgres server. This point, it lets you choose what version you wanna install. So we have started with 9.4, I think the lead most of our new servers are now 9.6, but you could still choose a certain version if you wanted to install, including a patch level too. So 9.6.1 is, I think, the latest one that we're driving. And then it configures Postgres variables dynamically. Essentially, it looks at the hardware and tries to follow some recommended best practices to say, this is how my shared buffer should look, this is how much memory I wanna allocate, this is how my wall compression should look, and stuff like that. Then it also deploys some of our standard scripts for backups, for setting up replication, for writing some basic maintenance tasks like vacuum and stuff, the standard scripts that we use. That, again, this is sort of very important because trying to deploy the same version of scripts across 45 servers and counting, this thing made it a lot easier. We have a central place to manage this too. It also allows us to create cron jobs for purging and managing archive and wall logs. As part of the cookbook, we also create databases and add standard extensions that most of our development team needed. So we also got away from the fact that a certain machine had an extension, what's this, the other machines did not, like other clusters did not. So instead of having to manage more cluster-specific extension, we wanted to be able to say, these are the standard extensions that we're always gonna provide you guys or provide the development team with. And if you need more, you come back and ask us more. And then we also create some admin logins as part of those cookbook. And standard databases, one of the databases we create is more an admin database where we collect some other matrix and stuff. It also adds our Postgres cluster to Datadog. So it adds a Datadog configuration file. For those of you who are familiar with Datadog, it allows you to put custom matrix in there. So stuff like if there's blocking going on, if there is a certain long-running session going on, which has been running for over 15 minutes or so, you could put all those matrix in Datadog configuration. And what the cookbook, we actually put all of those. So we have the same matrix available across all the production servers, standard sort of matrix. And then we create the Splunk directory structures, which allows us to move that log, the Postgres log and be available in Splunk. This allows, one of the more important reasons to do this was we have a support team. All they do is manage production and look at production. And instead of granting each of them access or individual access to these servers, Splunk acted as a central place, plus it has a very nice UI. So people who are not familiar with command line or don't want to learn that allows you to search for specific events and stuff on it. It also allows you to create alerts to say, if you see more than 50 errors in this Postgres log, send out an alert to these many people, let them know there's something going on. And then at the end of all of us, it just notifies the end user to say, we're done, server's ready to use now. And the next phase was deployment. So we also realized that as we start creating these multiple servers and multiple clusters across, we want it an easier way for people to deploy their code. So in comes Jenkins and Lequipace. Those who haven't again used Jenkins, it's more of a continuous integration software. What it allows you to do is deploy a certain piece of code from your repos. It could be from a source control essentially. So we use Stash at C-Vent. It's a pretty common repo these days. It also connects with GitHub a few, if that's what you guys are using. I believe it also has connections or as plugins to TFS for people who still use team foundation server or whatever. But it allows you to deploy the code in an automated fashion if you wanted to. So you could schedule a job to say, because we have about 45 sprint teams working on different things. They could schedule their own jobs to say, every day at 4pm, deploy whatever has been worked on to the development server. That way, when the next day they come in, the development server is using the latest branch or the latest piece of code that everyone has worked on across the sprint team. We also use Jenkins to deploy database objects. And the one thing that Jenkins also allowed us to do was to isolate the service passwords from the service owners. So essentially service owners are the primary consumers of the Postgres and they use a specific login for them. The password of that login is stored in encrypted format in Jenkins as a token. So all it does is it looks like a string, a standard sort of string, which gets replaced dynamically when you run the job. Jenkins job to deploy the object. That way, the configuration is hidden from people who don't have rights to look at it and it sort of helps us keep the password safe. So behind the scenes Jenkins runs Liquibase. Liquibase is essentially an open source independent library for tracking and managing database and schema changes. The reason or one of the more important reasons to use Liquibase from our side was because it allows you to create database in change sets and also allows you to roll back those change sets. So when you're creating a database deployment script or creating tables as part of your sprint, so a typical sprint cycle for us lasts about four to five weeks. So if you're gonna create multiple objects and that's four to five weeks on your development server and you realized, oh, I screwed up completely or this isn't how I want my design to be, you could roll back to a previous change set very easily but just running the same Jenkins job pointing it to a different change set or a rollback change set to say, bring my database back to how production looks today. And it's done in 10 minutes. There's no DB is involved, no administrator is involved. It's just the guys who own the service or guys who are working on the development team doing this. So this was pretty important for us because the database team is relatively small and trying to support these 45 sprint teams across the globe throughout weird hours was getting very tricky. So this helped us go a long way. But this is how our Jenkins job looks that allows us to do an Equivis deployment. It's a single job. All it takes, and it's not very clear but it takes something called as environment that lets you pick if you wanna do a production deployment or deployment in your lower development regions or somewhere else. And then it also takes repos and a branch. And then the version of the change so that he wanted to deploy at that point. The Hogan config branch is essentially a central storage for all our configuration. It's again, another in-house build tool that allows you to centrally put all the configs in one place and then generate it as needed. Behind the scenes, the job, all it is doing is running the Liquobases standard migrator. It's essentially running a Java code. That's all it is doing. But it replaces the passwords and stuff on the fly before it runs the Java code. That way, that whole process is encapsulated or it's completely hidden from the user who's running this code. And it also, like I said before, allows us to be completely hands-free. We're not even involved in this. The only time we get involved in this is when they have problems in production. And so for the next session, I think Yen's gonna show us how we use backup management and other database tools. Thank you, Kunal. So because we're in database team, backup is very important to us. I saw the saying that your data is only as good as your last backup. So we take backups very seriously. Like you can see, we have two major data centers. And one is the production, the other is lower region and disaster recovery region, where we replicate our data from production to the DR data center. And for each data center, we have a backup server where we take the backups and the wall log archives to store on those two backup servers. The backups are stored in the backup folder where they re-sync the backup between the two backup servers so that one backup is persisted on both backup servers. And also because we don't want to keep too much data, we have a retention policy set so that probably backup's older than seven days will get purged. And on the production server, because we have a policy that we want older data, there is also a archive folder where it stores the older backups. We use PG Barman to do our backup. There are many benefits coming with Barman. First is that you can backup from a slave. By using this, you don't have to worry that sometimes your backup spikes your CPU or memory on your production server. And second, it can use stream wall log backups where PG receive X log, which minimizes your potential data loss. Second is, third is the incremental backup. This is a very important feature for us because we have many very big databases. Some of them are like over 900 gigabytes. If you want to take file system level backup, it's probably gonna take a while. But with the incremental backup, it's only 30 minutes and the backup size is only seven gigabytes for a 900 gigabytes database. Also, it has automatic retention control so that you don't need to worry about writing your own script to delete all the backups. And in addition, it's very easy to perform recovery. It's basically, you can just run one simple command from your backup server. It will restore your selected backup to a certain server to a backup recovery point of your choice. But during the process, we noticed that there are some issues with it. First is that your backup server must have a connection to 22 and a 54 33 port, which is the port that Postgres running to the database servers. This is probably sometimes might be a security control. I mean a security concern for the security team. Also, I noticed that it does not support partial pointing time recovery, which means you cannot delete your warlocks without deleting your full backups. You have to delete all of them all together. We also automated the barman set up via Chef so that whenever a new database server is created, the backup server would be able to find out this new provision database server and create the barman config files, backup folders, and for this new database server. And also it can run check backup status of all of the servers every day and send out email to our database team to notify us if any of the backups has been failing and the reason it's been failing. Also, we record the backup file size name duration for reporting purpose. There's another thing that your backup is only as good as your last restore, which means it's not good enough to only have the backups. You have to verify that your backup files are valid, not crafted by doing restore. So we do that by having a night crown job that takes the latest backup of our production servers and restore it to a dedicated restore server. And it will also send out the notification to us if it finds out any recovery has been failing. Another important thing for us is monitoring. For our database team, the database sometimes look like a patient and where the doctors, we need to keep monitoring the health status of it in order to be able to diagnose any problems and react on it properly. The three major tools that we use are Datadoc's Planck and the PG Badger. Most of them, the installation configuration are from a VR chef so that there's very little human intervention about it. For Datadoc, it's kind of like a very comprehensive exam of our database server. It provides lots of system level metrics as well as the post-grace level metrics. Also, as Kunal said before, it supports customer metrics. So not only that our database team could specify probably the long running query metrics and also the application team, they can set up their own custom metrics and set up alerts on it when something goes wrong. Splunk and PG Badger mostly for analyze the PG AeroLog. Splunk, there's a very good feature about Splunk is that you can do search on it. You know that it's very difficult if you want to manually look through your PG AeroLogs but with Splunk you can easily search for the keywords that you're interested. And also you can set up very good graphs and alerts through Splunk. But with PG Badger, the good thing is like it comes with many already specified graphs, including the connections and sessions of the database and checkpoint time files and vacuums and long running queries and et cetera. So we don't have to manually or customerly design those graphs. There's also another tool we sometimes use that called PG Activity. It's kind of like top utility that we use in Linux which gives you the sessions that are running at postgres at a specific time and then orders them by the CPU or memory. This is an interface of our Datadog dashboard. We can see that there are many system level metrics including the CPU, IOS and et cetera. It also comes with many postgres specific metrics like the last one, the rows fetched or returned by your query and things like that. And there's a lot of them because we have each color of the line that represents one server. So what we see here is kind of like aggregate of probably all of our production servers. In this way you can see which particular server is acting abnormal easily without by comparing to other servers. And this is an interface of our Splunk dashboard which also is kind of aggregated by our aggregated graph of all of our production servers. It shows you the slow queries of the server according to the time. So you can see probably at around 12 p.m. some of the servers has a lot more slow queries comparing to other time of the day. Although we already have this many, we already have like four database monitoring tools, third-party database monitoring tools. We still find ourselves in some problem because we have too many database servers. We sometimes lost which database server is hosted on which server and what are the databases within this database server. That's why we came up with the idea of a database inventory where it collects all of the database instance statistics including some OS level setting like the IP OS platform memory CPU and instance level information like the database table index and logins and also shows you the history of the backup. The currently supports SQL server Postgres and MySQL. Because it uses a Chef knife API, it could automatically discover new provision servers. No human intervention is needed to add a new server into this database inventory. And another thing that's good about it is supports wildcard search across all of the objects. Maybe sometimes you forget where your database is resided in. It's gonna be difficult if you want to log in to all 40 database instances and find your database. And you can do a wildcard search here to find all of the database instances that has a database of a specific name. And also we link our application owner to their specific databases so that if we want to do a maintenance, we do not have to go through the trouble and find out who we need to notify. Through this tool, it could automatically notify the application owner about our coming maintenance activity. This is the UI of our database inventory where you can see it shows all of our database instances. And you can filter them by hostname, platform, and other OS level information. And also there's a search box where you can search for specific keyword that are interesting to you. It could search in both the login, table, index, and database, as well as the database instance. And also if you click into one of the database instances, you can see the databases reside on it and see a graph of what's the largest database, the owner and the created date. The other thing that we do as a database team is to do maintenance of our database. There are three major things that we do. One is to perform major version upgrade. The three tools that we use most often are PG Logical. With that, you can replicate the data across different major versions. And at the time of your cutover, you simply drop the replication and have your application server point to the new version of the database server. PG upgrade is also good because it could do in-place upgrade without needing to do data dump and reload. But they also come in with some catches, you may want to test it, because for example, it requires your new and old version of the database on the same volume. The other is data dump. It's kind of like a brute force approach. We normally do this for small size databases where we just stop the application and do dump and reload of the data and then have the application point to the new database. The other maintenance task we do is to do refresh. At the end of every spring, the development team may decide that they do not want the old database and they want a brand new database with no data in it. So there are majorly two things we can do. One is to wipe out the entire server and reprivation it with Chef cookbook. The other is to recreate the database. Recreate the database is easy for our database team, but if you want every development team to be able to do it on their own, we created a Jenkins job for them so they can just specify the server name and the database name to easily create Jenkins jobs for themselves with our standards in it. We don't need to worry that some of the databases are created without some features that we want. Another thing is fill over. If you want to fill over one single server, it's gonna be easy because you just need to run one single fill over command like touch a fill over file someday. But like us, we have 40 database instances. If you want to fill over all of them at the same time, it's going to be a lot of problem. So what we do is we created a shell script and put that on a centralized server where you can just run one simple command. It will read a list of all the servers in our production and fill over all of them at once. So you do not have to log into each of them and run your fill over command. That concludes our presentation. Is it? When you were ready to join the party again, yes. One question, sure. The DB inventory tool, does it pull the information from your databases live or is it like a fast draw? So the tool currently runs, I believe, every 12 hours because we didn't want it to run continuous. It's mostly used for capacity training and looking in the detail of the database in case. So DB inventory is sort of open to all developers as opposed to being restricted. So anybody who has no ink credentials can log in and look at what the databases are. And so we run it every 12 hours right now. It could be run more often though. We haven't had a use case where we had to run it more than a few hours. Is that it? Thank you guys. Thank you guys.