 Yep, one, two, three, four. Are you in pause, record? Volume 100. Pause, record. One, two, three, four. Usually the thing doesn't give me anything. Yeah, and I'll see you at the dinner, right? So, so... One, two, three, four. You're getting nothing. So I'll go ahead and get started. Thank you everybody for coming today. My name is Jennifer Charelle and I'm going to talk about a project that I have been involved in for the last ten years working with energy efficiency data here in California. And this is a project that started out on Microsoft SQL Server and we have been migrating our data processing to Postgres. And so I'm going to give you guys a rundown on our experience and the migration and how it's worked out. So just a road map of how the talk is going to flow. We'll start. I'm going to introduce the data problem. I'll briefly introduce myself in my background. I'll introduce the project and the data and the data requirements that we have been addressing. Then we'll get into the need of the migration, what the key motivations were, the roadblocks that we had to resolve and how we figure those things out. And then we'll wrap up with some lessons that we learned in the process and what we're looking to do going forward. All consulting business found data ten years ago. I'm based in Seattle. No one else in my team is in the state of Washington. I have three people in Australia. Someone in Colorado. We're a pretty classic small remote team. And we have been specializing in energy efficiency data for the last ten years. And so one of the questions that I usually get when I talk is why does a consulting group that specializes in energy efficiency data have a fish for a mascot? Or a logo, I should say. So the reason for our logo is that my background is actually in fisheries. I have a master's in fisheries. And before I started SoundData, I spent ten years building aquatic ecology databases at the University of Washington. And so when I first started SoundData, I was heavily steeped in fish It's also that a long time ago, when I was fishing in Glacier Bay Alaska, I caught one of these little spiny lump suckers that is our logo. And so they have been a personal favorite ever since. But getting back to business, over the last ten years, SoundData, we have done a wide variety of projects. We have done multi-year, large efforts where over a million dollars is invested in data processing. We've done rapid prototypes where the entire project started and ended in a month. We've done most everything in between. So we have a lot of experience with different scope projects. We've also used pretty much every relational database software out there. Well, not every. It's all of the main ones. And so as an example, Hydra, this project that's listed in this, these are just screen grabs off my website. If anyone wants to go look at any of them. So the Hydra project was our first project. It went online in 2008. It has a Postgres database back end. And it's a website for fisheries researchers who are doing acoustic telemetry research to share data and have their data processed. Hydra started as a grassroots collaboration between a handful of scientists in Seattle ten years ago. Today we have more than 200 researchers participating all across the West Coast of North America. Hydra is very much my poster child of what a great project can be and how projects can be really efficient. In the ten years that Hydra has been online, it has cost less to run and develop than any other project I have worked on. But Hydra's not actually what we're here to talk about today. This is migration and Hydra started in Postgres. And the reality is there's a lot of projects out there today where we already have projects, the data running in a relational database system other than Postgres. And so for those projects, the path to Postgres is migration. And so this here is actually the project we're going to focus on today. It's my migration project. And the project is called the California Energy Data and Reporting System, which reports acronym to CEDR. So CEDR started out, well, and I started out working on the team ten years ago. CEDR is a project for the California Public Utility Commission. They're based in San Francisco and they are charged with ensuring that energy, communication, transportation, and water industry are regulated on behalf of the California public. I work in the energy division of the CPUC and within energy division I work specifically in the energy efficiency. So the CPUC authorizes every year and has been doing this for many years, $1 billion to be invested in energy efficiency programs in this state. So it's a huge investment and the CPUC authorizes those budgets to program administrators who go out and run the program and do the energy efficiency work. The CPUC requires that the program administrators have cost-effective portfolios so that for every dollar they have to run their program, the people of California get back at least a dollar in energy savings. The main program administrators that we have are the four large investor-owned utilities in the state of California. So PG&E, SoCal Edison, SoCal Gas, and San Diego Gas and Electric. Those four utilities dominate this portfolio and when I first started 10 years ago they were the only program administrators. Since 2012 we have begun adding small green local community energy networks as program administrators. They run much smaller budgets. They tend to target different programs and different people. But we're up to now having initially four program administrators. We now have nine. This whole world of the CPUC energy efficiency data reporting team is filled with acronym. And I've been in it for a long time and sometimes I don't even hear myself say them. So the main ones that I can't stop myself from using are CPUC instead of saying the whole California Public Utility Commission and that we call our program administrators PA. So when you hear me say PA I'm just talking about the various groups who actually run the programs and do the work in the state. So the thing to think about is that utilities in California make money by selling energy. So they have no intrinsic incentives to do for energy efficiency. That's lost revenue to them in lots of ways. And so what happens is that if the program administrators hit the savings goals that are set for them and they have a cost effective portfolio they get a pot of money at the end. A financial incentive. And those financial incentives are in the millions of dollars. Generally up to 20 some million dollars per year that the utilities will get for having done a good job with their energy efficiency portfolio. It's also worth considering when you think about what the programs are that are run. So at any point in time and this has been pretty steady for the entire 10 years I've been involved there's about 400 programs being run for this billion dollar a year total investment. These are programs that vary from groups will go out to Costco and arrange with them to buy down the price of their most efficient LED light bulbs to incentivize customers to go buy those most efficient choices and install them in their house and save energy that way. There's also a low income solar program where low income households can have solar installed at no cost to them and have the benefits of that solar installation. So there's a wide variety of different programs that are run and as I said we run pretty steady at having about 400 programs in the portfolio. So when I first showed up in 2009 honestly it was a big creep. At that point they had data in hand from the programs that had been run from 2006 to 2008. So we had three years of program accomplishment data. The CPC refers to these data as claims and that's the term that I'm going to be using in this talk. So the claim data is where the PAs report to the CPC this is where we went and what we did and how many of them there were and how much energy was saved by that effort. So we had three years of data in hand and more coming in. Claims data are submitted to the CPC five times a year. So we were kind of drowning in our data stream. The other really critical attribute going on when I first arrived was that we had no data specifications. Program administrators were told give us your data and that was the extent of their instruction. What you guys can probably imagine happened was that we got 400 different data reports. So the big utilities PG&E and SoCal Edison run about 100 programs each. They gave us in one case 100 different access databases. One for each program and said there are our data. We were like oh shoot. So we spent a tremendous amount of energy taking 400 different reported data sets and compiling them into a single data set. And at this point, so right as I joined the team they had just agreed to have their first database server and they spun up Microsoft SQL server database server. There wasn't a formal evaluation of what would be the best tool. It was really that this was a group a team of people who worked in Windows and choosing the Windows database felt most comfortable and efficient and the data input and output at that point in time had been agreed between the PAs and the CPC that data was going to be exchanged as access databases. That was not a great solution at that time because what that meant for us is when we first reported out the compiled data sets we would have to break it into about 7 different access databases because of the 2 terabyte limit to the size of the single access database. So if you wanted to pick up the data that we were posting publicly you had to pick up 7 access databases and then compile them back together yourself if you wanted to be able to look across the entire data set. It was far from ideal but that was where it was at the time. So once we got caught up with the data stream which took about a year we started to improve what we were doing was that we defined a data specification and we said you need to tell us your program claims and your data needs to look like this. And it was a pretty simple data set that we started with. It was a flat file, it was not relational tables and it very much came out of the work that we had done for 3 years to take those 400 different data sets and bring them into one. We didn't have done anything else to drive this forward. You can't begin to add automation or to do change management on the data when you have no specifications. So that was a critical step. That data set we've also iteratively improved every year since it was first defined. In about 2012 we moved to a relational table set and as various requirements and different decisions have come through the CQC we've continued to expand and refine that data set. We also had some serious performance issues that we had to keep working away on as we went. The first time that we ran the full data set through the complete processing there was one step by itself that took 4 days to run. So it was tough. It was really tough to get the data through and to make it happen. There's also some integrations that needed to happen and some security that needed to be addressed. There are confidential data in the claims data that we receive. The confidential data or the name addresses phone numbers emails of the participants and I'll get to why those data are important a little bit later. We were working along we had our Microsoft SQL server database server that we worked on things were being iteratively improved and then one faithful day the incident happened. This was a day where the data team was crunching really hard on a deadline and the deadline has actually just passed so we were already late and the server crashed. It was certainly not the first time that database server crashed but this time it took 5 days to the tension that existed during those 5 days was intense and that was the moment where everyone stood back for a moment and said this isn't working and we really opened the door for us to be able to propose and execute a migration to Postgres. So you guys have a feel now for the claims data in the energy efficiency portfolio that the CPUC runs in California but of course the claims data in Cedars is not operating in a vacuum so we also have there's two external systems that the claims data are required to integrate with. The first one is called Deer, the database of energy efficiency resources. That database has a variety of different defined values for things like building types or building vintages and these different things that we use to determine how much energy was saved by the work that happened at that site. And so we need the program administrators to use the values from Deer when they report their data to us and so when we started out we just told them to do that. We said make sure the values you give us are Deer values and they said oh sure and then down the road some people would manually pull Deer down and grab the claims data and put them side by side and try and figure out if they had given us a tough way of doing that. The other external system that we that the claims data have to interact with is called the CET. It's the cost effectiveness tool and it's a calculation engine where you take the data that's given to us in the claim and we run it through a database that outputs then how cost effective that work was. And so cost effectiveness value of one means that you are a value for every dollar invested. That's the threshold they have to meet. We hope that generally people get cost effectiveness metrics that are 1.5 1.8 something higher than just one but one is the threshold and so when I mentioned the steps on the last slide that was taking four days to run that step was running data through the CET. And that four day process was not actually the entire data set. So we chunked the database into pieces and sent them to different teams and the team that had the biggest chunk took four days to run. Oh that's a great question so we get about a million claims a year when we run the full data set at the end of the year and actually that was so for 0608 that was operated by the CTC as a three year cycle. They have since moved to a single year cycle. We only do one year at a time. So we were actually running three years of data at that point so it would have been 3 million records. So as we were at this moment where we were ready to think about migrating our system we also were dealing with the fact that we needed to make this integration no longer manual. People would pick up place data and put them into the CET database and execute the run CET command and wait and then get the data back out and send it back to the central database and it could get compiled. There were so many manual sets every junction along the way and this is the best way I came up with to illustrate this so this is a very high level flow diagram and it's showing our program administrators or PAs they're submitting their data the Microsoft SQL server database we call EDCS the energy division central server on there we would compile and QC it you would need to bring the dear data in to do the QC work then you would need to send it to the CET to be run through the calculation engine and then we'd need to recompile or compile the output and then once all of that was done the data would publish out to both ED staff for them to write reports and summaries to do analysis and to the public because we have a mandate that all of our data is publicly acceptable except for the confidential data so at the end of our first run with the 0608 data the process to get from here to here was four years for some of the data so we were dealing with three years of data at this time so it was two to four years so from 2006 to 2008 the program administrators gave us their data we published it here in 2010 so we're looking at four-year latency and that was a really big problem for the CPUC a lot of people look at the work they're doing the data that they're publishing and people were upset that they were having to wait so long to be able to see these data and so the question became how do we make our energy efficiency data more efficient and so for this particular project we had three key things that needed to happen we needed to fully automate our data processing program administrators at the worst of it the same data came in five times a year at the worst of it the QC feedback the technician wouldn't get to the program administrators until after they'd already submitted their next data set and when you get to that latency it was taking up to six months for program administrators to get QC feedback when they're required to submit data five times a year so the QC feedback wasn't being acted on because they'd get it and they'd be like wait that's not even our last submission whatever and they just keep going really deteriorated and so what we wanted to do was give our program administrators real time QC feedback and we wanted to do proactive quality control so that data didn't come into the system unless they passed the rules up until that point we took all of the data they gave us we figured out which ones violated the rules and set up a meeting with the PA we put a WebEx or a go-to meeting up to show them the things and then they would resubmit their data to us and we'd do it all again not very efficient we also wanted to eliminate the manual steps that were integrating against the other servers the effort to reconcile whether or not the data have really used the right gear value was many humans for many months every year trying to do that reconciliation and not being able to do a particularly good job because of the way it was set up and the big piece was to migrate we wanted to get the automation we needed we wanted a web application and to have an efficient web application to program we needed to move off of Microsoft we also the energy division was really struggling with the expense of the server particularly given that it crashed regularly and wasn't performing as we would have liked it to we went through a stretch of time where we took turns using the server because we were crashing it too often so I would have to wait for the people ahead upstream of me to finish their pieces and then I'd get the message and then I would have the server until I finished my piece and then I would ping the next person down which you can only imagine how furious that means being asked to take turns on a server is ridiculous you should stop calling it a server at that point anyway so moving on when we got to the point where we were doing this migration there was a chance to do a little bit more of what you would think of a more traditional evaluation of what are the relational database software available to us and which one is the best fit for our project and so for me looking at this the key factors to look at are the database administration overhead Microsoft's people server has pretty high database administration overhead particularly compared to a program the development expense I can develop a database in Microsoft SQL server about as efficiently as I can in Postgres but the web application option for the two are completely different in the efficiency and we were going to be doing a lot of web development for our automation and so we didn't want the expensive Microsoft ASP.net web application development we wanted to be efficient open source it also gives down to performance which I've I'm sure realized was an important consideration for us in this project after the incident the integration for us it's funny because the integration that we needed to do we had two external servers we needed to integrate with one is Microsoft SQL server one is Postgres so the integration wasn't going to drive us so I would say it's easier having the claims data in Postgres to integrate with the Microsoft SQL server than I think it would have been to have that be Microsoft SQL server and have to integrate to the external Postgres one and also some tools and honestly at the point where we were doing this migration there weren't any particular tools that we needed that weren't present in both SQL server and Postgres so that wasn't a very compelling reason in this particular case but I'll get to later that we did find once we were on Postgres that there are tools there that we would never dream of giving up now that we have them but it wasn't a motivation at the time and as far as oh I was just going to mention that for development expense as I said I don't really consider SQL server it's slower to develop in Postgres but not hugely Oracle is the option where I've built Oracle systems and I would generally say I could build the same thing in Postgres in about half the time as it would take to build the same functionality Oracle is powerful nobody was really interested in moving to Oracle so migrating to Postgres what we stood to gain sorry I convinced myself to try not to talk about this but just very briefly when Microsoft SQL server was first spun up in 2009 there was a decision made at that point to also use the database server as an SFTP parable right? thank you for laughing so we had told them that we had to stop doing this the SFTP had to be moved on to its own server and the database server had to be used as only the database server the extra insult of the SFTP being added is that it made us have probably 10 times the users on the database server that we would have had otherwise and so we had a handful of people who actually needed to be on the database server building things in the database and we had 35-40 other people who remote desktoped onto our database server to use it as an SFTP and browse through files other people had had posted so that was a big upgrade if you will to what we were doing that we were able to do at the same time as the migration user access as I said you know we needed to get all the SFTP people off of the database server oh and for context on size scale when we moved the SFTP off of the database server we had 3 terabytes of files in that so it was it's also a very expensive way to stick a drive onto a database server and then fill it full of files it's just a way of doing things so anyway the other problem that we had at the before we migrated to Postgres was that energy division the CPUC could not directly access their own database server because of the CPUC IT firewall they would not open their IT at the CPUC would not open the firewall so that the staff there could directly interact with this database and I was embarrassed to say for a number of years that the only way that my clients had their own database server was to call me and tell me what they wanted and I would email them whatever they had asked for so it drove me absolutely crazy and we wanted the CPUC to have direct access to their own data we had been as far as automation we had been driving automation forward for the entire time every year we added more automation and we were at manual steps we got from data processing taking a year to data processing taking two weeks this is before the migration so we needed to finish automating the system that all helped but it didn't get us to where we needed to go but we also were looking to gain with some transparency the relationship between the CPUC and the utilities to really care for being regulated there's a lot of history there was a lot of confusion and lack of transparency the standard back in the day for a data spec is that you would make it by hand in Excel and send it up to people and then there would always be mistakes in it and somehow something would get mismarked and so there was a lot of frustration performance we talked about we needed to get things down to minutes where things were still taking weeks and then reporting to be able to get the data out to our audience quickly but our latency was just unacceptable so the CPUC agreed to migrate and we I showed you guys that really high level flow of claims earlier with just the few boxes this is a more detailed version of that same problem space but this is the flow for the new tool that we've created that runs on the Postgres database and so if we start here that's the specification so what we did for this project is we wrote the data specification in a human and machine readable format and so it's posted on the website as the specification that all of the program administrators can read to understand what they are required to do it's also literally what our web application reads to build out the CPUC and data processing engine one of the things we're required to do in the system is update our specification once a year there are different decisions made by judges, different things that happen one example is that recently we started adding water savings where before we only had tracked electricity and thermal energy savings but as water has become a big nexus and efficiency now we have a water table as well and so we made this specification the heart and soul of this application users then upload our program administrators upload claims that meet the specification does the CPUC in real time says thank you for your submission you submitted 500,000 records 450,000 records passed and have been accepted into your claim portfolio 50,000 failed these are them and these are the things that were wrong with them and then they can go back and fix those records we submit the ones that erred on the first try and get their data set to be whole so the program administrators have the ability to upload many claims if some administrators prefer it depends really on their internal system which we have no per view over that for that example I gave where they gave 500,000 records and we rejected 10% if the user prefers they can re-upload these 500,000 the second time they can fix them the whole thing back they can just do the 50,000 and just give those back we don't care Peters is happy to take the data either way and with having this all fully automated we gained the ability to attract these events and have histories and logs of all of the uploads that happen when the program administrators are happy with their set of uploads that they've done in a particular submission they confirm it once they confirm that submission we send the data over to the CET and run it through that calculation engine and get those output back update all of our dashboards those data become immediately available to the public and to energy division staff to the variety of different downloads that we provide the largest the largest claim confirmation that we have in the system will run through this entire process now in about an hour so we've gone from two to four years at the beginning to where we got to months and now we're at we do the entire process in an hour so it's a huge step forward for everyone and the ability to get data out to the people who want to see it in a timely way the actual page in CEDARS for the claim setification and CEDARS is a public website you guys can go there if you're looking for data to play with in your database we have a awesome data set publicly available here definitely go grab some I'm looking at you Joe this list here these are all the value lists that we get from our external system the DEAR database so we have an API every night we go and refresh those value lists from their sources and as I said before we've got things like the building type and building vintage lots of other acronyms that don't need to be translated the other side of the specification is what we call the source of truth and that's really the heart and soul of the application as I mentioned before and so the core of that is the stack SQL document that's where we define the tables the fields, the data types and all of the single field validation that's required for the data submission we then have a set of validation rules that are for more complicated rules so there's a lot of where if this field is that value then this other field can't be this and these sorts of things that we needed to have the more sophisticated rules we also have a set of warnings that follow the same model as our validation rules we have a read-name but no one reads I don't know if anybody knows the solution to that but I haven't figured it out we have a metadata file we have an example, good data file that people can take to start off from and then we have a change log where we document every change that's made to this specification since Cedars was created so if we jump to this is our claim spec the SQL document and so this should look pretty familiar to most of you DBA folks we define each of the tables the fields, the data types we invented a few different things for the web application to be able to do the QC that we required so we have things like some of our fields have to start with random things I don't know how often anybody else has to do that different things whether or not the field is required to be filled out you can see as we get down here towards the bottom we also have a bunch we use to standard mathematical notation to put limits on our numeric value so if you look at these fields right here it's got a number range and then it says 0 to 30 this is the estimated useful life of whatever energy efficient everything was installed and there's a limit in the regulations for these programs that that can't be more than 30 years and so we enforce that no values are submitted for those that's more than 30 years so we have a nice suite of QC rules one of the other ones that was obviously really important for us was the value list so we have value lists we have some deer and the CET that we have to enforce so in this spec it tells you exactly which fields have value lists and whether or not that field is required because some of the fields have value lists but they're not required to be populated so this is an example a very simple example from our validation rules where it gets to this is where we define all of the quality control rules that don't that aren't just single fields that can't go into the the SQL spec and so this is where so this is an example where in the program cost table the year has to match the quarter so if they tell us that this is a program cost from 2019 Q1 the year that they give us for that has to also be 2019 here's a more complicated example that gives an idea of how we approach this this is a gas savings profile is a field that's required by the CET to determine the cost effectiveness of the data and so if there are therm gas savings for any particular claim they have to give us the gas savings profile unless it's a codes and standards claim so it's just I know exactly what codes and standards claims are but they're a special subset of the claims that come into our system and they get excused from some of our data rules and so if this we set this up so that we have we give the wear connector here so we can say we have tremendous amount of flexibility in defining sophisticated and nuanced rules for different things and so this is probably impossible to read even in the front row but I just wanted to include this what this diagram shows is this is the CETER data processing here this is the CET so for CET we set up an API that runs on demand so when someone confirms a claim in CETER that data is sent across to the CET server which is in Microsoft SQL server those data get picked up clocked into the CET run through the calculation engine the outputs are returned they get dropped into an output directory and the CET server pings CETers and says I finished job 432 and CETers comes and grabs them and then processes them into the CETers database and so this is fully automated there's no manual components whatsoever you can see it's not simple there's a lot of steps along the way but we have this running the biggest data submissions runs through this entire process in less than an hour now so when we did this migration and brought our data model from SQL server to Postgres I wrote our new Postgres data model by hand if I were going to do that again I wouldn't do that I would use PG loader I have nothing to do with PG loader I'm not on the team for them it's a fantastic tool you can use PG loader too in a single command migrate SQLite Microsoft SQL server or MySQL databases to Postgres PG loader goes to the external database it reads the data model it builds that data model in Postgres and then moves all of the data across for you unbelievable tool we do have we have the thing that hasn't yet been done for our migration is the the CET tool which is written in Microsoft SQL server hasn't been brought across it's the one thing that still makes us have to run a Microsoft SQL server machine and so when we get to the point where we're going to do that in the next year or so I am definitely using PG loader to do it the CET is a really complicated calculation engine it was written by people who are not SQL programmer so if you go into it and you try and trace the data through the process it ends up looking a lot like a pile of spaghetti I inherited the CET a few years ago and it's been pretty intimidating to think about actually migrating it until I realized I could just and I recently installed this on a production machine for a different project where we're migrating a SQLite project to Postgres it's been crazy easy we had two months to do it and we did it in a week and now we're all standing around looking at each other but all you have to do to install PG loader is to run that one thing which is brilliant so this is my plaza for PG loader really briefly I just want to mention that one of the things to consider in a migration project is what to do with your data stream while you're doing the migration that's going to depend so much on your particular project but that was one of the things that we had to figure out before the client was going to agree to move forward with the migration is how do we keep up with the data coming in the door doing this work to improve the system and migrate so obviously a huge piece of what you get by migrating from SQL server to Postgres is that you save a ton of money so right now we run four database servers three of them are Postgres one is Microsoft SQL server we spend more on that one Microsoft SQL server than on all of our Postgres databases and we get better performance out of our Postgres databases and all four of those machines are the exact same AWS R5 XXL machine it's not that they have different resources or cores or memory they're exactly the same machine for all four of those and so that makes the client really happy this is a monthly expense this group invests on the order of thousands of dollars a month in database servers and has been doing it for ten years being able to offer really significant cost savings is really awesome so what we really got and I wanted to try and make clear because this is a project where we didn't just do a migration and have that be a standalone effort we were doing the migration at the same time that we were completing automation of our processing and so what we really got from this migration was a lot of transparency oh sorry these are divided so these are the things we got for the migration itself and these are the things we gained by finishing and completing all of the automation so that we have a fully automated system one of the things to note is that it very significantly improved the relationship between the CPUC staff and the program administrators having the transparency the accountability the stability of the system did a lot to bring people together and move us forward in that relationship better access as I mentioned before this one was a big one for me it really I was very embarrassed that Energy Division couldn't access its own database server so I love but now they can hop on the website anytime they want and download all of their data or just get that one particular answer they're looking for a postgres for the win we've been incredibly happy that we undertook this migration the database administration overhead is so much lighter compared to what we were having to spend well and what we still have to spend on our one Microsoft SQL server database our development is really efficient we're getting better performance our integrations are easier I was going to mention a few of my favorite tools that we have in Postgres but I think I'm just out of time so if anybody wants to chat about those definitely feel free to grab me at any point in the next couple days and I really would say that the migration was painless and having done as I said right now we're just finishing the migration of a SQLite database also painless so if you are considering a migration I would just say no regrets I wish we'd done it sooner and I'd be happy to answer any questions oh yes the web application is Django it's Django yep yep thank you very much welcome everyone is that any better now that you're all awake hopefully so I'm Craig here to talk a little bit about Postgres extensions today a little bit of quick context about me I work at Citus Data a few months ago by Microsoft so now I guess I work at Microsoft run our database as a service so if you're not familiar with Citus we turn Postgres into a sharded horizontally scalable database we're pure extension so if you're running into issues unlike a single node Postgres that's where we come into play prior to Citus I was at Heroku for five and a half years mostly ran Heroku Postgres so built and ran that team I also curate Postgres weekly I encourage you to subscribe it's a pretty highly curated list of articles from that week around the Postgres world it's more targeted on the app dev side less DBA so if you're not a DBA and kind of want to know what's cool in Postgres what's new, when new vulnerabilities come out those sort of things I encourage you to subscribe so let's get right into it what are extensions before I actually get into what they are a bit of context I'm increasingly of the mindset that extensions are the future for Postgres Postgres has been a really really solid stable database for a long time it's been around for over 20 years and I think the number one kind of core and goal of Postgres for the longest time was be safe don't use your data which is a pretty good thing for a database it had a few rough edges it wasn't necessarily as user friendly as some other databases but in the last few years it started to catch up a lot there so we've got things like additional index types we've got JSON-B we've got full text search and it's become more and more powerful I think it's pretty well regarded as the most advanced open source database that exists today but Postgres moves out of the release cadence of a new release every year and things that go into the core of Postgres have to have a certain really high standard they're not going to take the newest fad and just drop it in I actually recall chatting with some of the core community about JSON this was six, seven years ago and one of the developers that worked on it over in Russia he's like never built a web application before never seen JSON before and kind of like why do we need this like no one uses JSON we have data types why do we need JSON I remember I was like I've seen this before there was a time when XML databases were going to kill relational databases so what did Postgres do they added an XML data type so in a sense Postgres has been a no-SQL database because it's had XML support which is a no-SQL data type for 10, 15 years now but I say all of this because it moves at a very very slow steady pace so we got JSON but it was an uphill battle maybe a fad and go away why do we have it does XML make sense in Postgres maybe not but we added it a long time ago so increasingly what we're going to see is I think more velocity outside of the core engine so the core engine is going to continue to improve be really nice and stable but we also want the latest and greatest how do we have a testbed for new features extensions are essentially that extensions are really low level API hooks where you can hook in and change you can change new data types you can change how it interacts with tables I expect a lot of the plugable storage stuff will kind of extend on this and essentially in time we'll be able to have separate storage engines just based on these extension APIs so you've probably already been exposed to extensions even if you haven't realized it a number of extensions already come with Postgres these are contrib extensions in Postgres you've got pretty quick access to about 17 different extensions that are in the contrib I don't anticipate a lot more being additive contrib because the story around external extension is expanding others have to be built against the source and installed they're already kind of compiled to your distribution and then there's a number of hosting providers that already have support for a number of extensions so if you're on an Amazon RDS if you're on Azure, if you're on GCP they already have a support for a number of extensions that you can go and kind of look at what's supported and if you don't see one supported there I encourage you to ask them to support it we're going to see more and more of these evolve in time so on the contrib side HStore was a key value store within Postgres it's still fairly useful, I see people using it today you can index it with like GIN indexes which are really powerful CI text if you ever migrate from MySQL over to Postgres MySQL by default doesn't have any case sensitivity and if you for some reason expected that behavior in your application and needs to preserve it it's a great data type for it the Postgres foreign data wrapper so you can connect and query from one Postgres database to another and then pgstat statements we'll drill a little bit further than that but all of these are contrib extensions so if you've used them, enabled them you're actually working with extensions and you didn't realize it but the the world of extensions is growing more and more so around the non-contrib ones external, not shipped with Postgres you've got PostGIS which is huge, if you do anything geospatial you're aware of this Citus, we turned it into a distributed started database the application still looks like a single node database but under the coverage it's split up across multiple nodes Zombo is a really fun one Zombo will allow you to have elastic search indexes and use those from directly within Postgres so as you're updating your data it's being sent over to Elasticsearch maintaining those indexes and then if you're doing text search you can basically query directly into Elasticsearch from within Postgres and we'll drill into a few more of these a little bit more so today we're going to give a kind of a whirlwind tour this isn't going to be exhaustive of any of these, like you can go much much deeper but hopefully it increases the surface if you see one you're interested in drill in further so the first one, pgstat statements I highly regard this as the most useful Postgres extension you should have it enabled by default on every database you run you should leverage it on about a monthly basis what's it do so it records all queries that were run and all sorts of stats about them and within your database so it basically takes in a query if you have something like select star where email equals Craig at siteasata.com it's going to parameterize my email and say hey how many times were you looking up users by email or something like that it's going to record how many times it ran and all these sorts of stats I don't understand half of these stats it's okay you don't have to like how many shared blocks were read and dirtied all sorts of internal tools and there's other Postgres that will go and examine this in a deeper depth what's really nice is you can have this really really simple query that's going to give me the total amount of time this query was running against my database and consumed the average and the query itself so I can get a really really high level look of what's consuming resources against my database I don't have to go through and dig through all of my application code I don't have to read through logs I get a really really high level look on okay this top query here is in total consumed 295 seconds against my database it averages 10 milliseconds and if I'm fresh looking at a database I can start here and say where do I want to optimize if I know a query can perform in typically one millisecond I can go and optimize that second want to get two orders of magnitude of time back against my database super super useful I highly recommend if you don't have it installed just install it now and then to take a look at it every month or so every three months just a quick checkup it'll tell you a lot and then there's a number of queries you can adapt on this of hey I want to look for anything that runs over a certain period of time needs to be run at least 500 times that sort of thing a small plug if you're on Citus we actually have an extension to this an extension to an extension, Citus stat statements so for a lot of multi-tenant apps we see where you shard by like a customer ID if you're Salesforce.com each Salesforce instance is that own tenant's data and so what we do is preserve the tenant ID so if you have like where customer ID equals foo we're going to preserve that so you don't lose that whereas you normally would with PG stat statements so really powerful so that you can see like who is my noisiest customer who's consuming the most resources so that I could maybe build them more or go and move them to their own node that sort of thing so PostGIS PostGIS is probably the largest most advanced this is Michael, this is Michael, I don't know this is non-copyrighted material so I'm just going to blab for a little bit oh look there's Dustin, I'm going to wave it Dustin hey Dustin yeah I'm just making the noise so I don't know this is non-copyrighted material so I'm just going to blab for a little bit and hey Dustin yeah I'm just making the noise so I'm just going to blab for a little bit oh look there's Dustin hey Dustin so I'm just going to blab for a little bit oh look there's Dustin, I'm going to wave it Dustin hey Dustin yeah I'm just going to make this as Michael I don't know, this is non-copyrighted material so I'm just going to blab for a little bit oh look there's Dustin, I'm going to wave it Dustin hey Dustin this is Michael, this is Michael This is non-copy writing materials, I'm just gonna blab for a little bit. I'll look at dustin' I'm gonna wave it dustin'. Hey dustin, yeah I'm just making noise. So that's our recording for today right now. This is Michael, this is Michael. I don't know, this is non-copy writing materials. I'm just gonna Blab for a little bit. Oh look there's dustin' I'm gonna wave it dustin', hey dustin'. Copy writing materials I'm just gonna Blab for a little bit. Oh look there's dustin' I'm gonna dive it dustin'. Listenes here. Look, I'm just gonna waste this up here. Okay, I'm gonna use my chopper to grind this up so I don't end up going through the metal Extreme Detectives. Anything else? Hi Dustin, I'm just making you noise. So—that's horrible. This is non-copyrighted materials. I'm just gonna blab for a little bit. Oh, look, there's Dustin. I'm just gonna waste some time. Hey Dustin, um, yeah I'm just making you noise. This is halfway through. Hey Dustin, yeah I�m just making noise. So that's our... This is Michael. This is Michael. I don't know. This is non-copyrighted material. So I'm just going to blab for a little bit. Oh look, there's Dustin. I'm going to wave it Dustin. Hey Dustin. And Michael, this is Michael. I don't know. This is non-copyrighted material. So I'm just going to blab for a little bit. Oh look, there's Dustin. I'm going to wave it Dustin. Hey Dustin. This is Michael. This is Michael. I don't know. This is non-copyrighted material. So I'm just going to blab for a little bit. Oh look, there's Dustin. I'm going to wave it Dustin. Hey Dustin. This is Michael. Oh look, there's Dustin. I'm going to wave at Dustin. Hey, Dustin. Mic check, mic check. Ask him if he can hear this. Dr. Smith, can you hear me mic checking? Mic check, mic check. This is a mic check. Can you hear this signal coming in? We're not sure what's going on. So I'm diagnosing. You can hear this. And that controlled reason. It ended up taking too long. And our application, other maintenance things to work on that table. So with the transaction control that you're allowed with store procedure, we run within a store procedure. Because the idea is, well, the store procedure is supposed to be a script, so to speak. If you're thinking of, for the instance where you're thinking of your store procedure to be a script, it might be nice to be able to call some vacuum statements. Or even things like create index. Concurrently, I think specifically. I think create index concurrently is the variant that you can't run within a transaction block. And finally, I think this is the last, yes. Finally, the adding that transaction support to other, to those other, not procedural language, but other interpreter languages that you may want to run within a store procedure to get that in there, to get more languages supported with store procedures. Is it? Oh, okay. Yeah, providing, yeah, just providing that way to call a commit, right? Right, right. Yeah, so I would say that it's, if you don't see your favorite language up there, don't. Yeah. It's not because there's a huge technical barrier. It will take a little bit of work and hopefully someone will step up to getting those additional tweaks into those languages to get them in store procedures. Yes. Well, the question is, how would you do error handling? Yeah, yeah. There's no special error handling. But what I'm trying to remember how to phrase appropriately is why am I drawing a blank? I'm picturing try and catch, but that's not, yeah, right, right. Well, why do I think that there's a generic sequel? Oh, it depends on your program. That's what I'm trying to think of. It depends on what your calling language is. You would still use the same sort of error handling that you would. So whether it is a try and catch and whatever language is, support those specific keywords or if you were to do C, for example, you would still look at the sequel error codes and whatever is thrown or raised out of the store procedures. So there's no special, no additional handling with the store procedures. It would fall under the typical things you may or may not be familiar with. Was there another? So that's all I have for you today. Just wanted to remind folks that there will be a Postgres table in the Expo Hall. So you are more than welcome to come visit, grab some pins, stickers. I'm sure you can find one of us and ask more questions to you throughout the week. Yes, yes, that's right. The Expo Hall is in the other building. All right, thanks, everyone. At this point, you can hear me. Sorry for the voice. I've got a little bit of cold recently, but I hope it will prevent us from having a nice presentation. And yeah, so thanks, everyone. Thank you for coming. And today we'll discuss why it's really important to stay curious, especially if you're working with Postgres QL. But first of all, a little bit of background. So my name is Dimitri. I work for Zolando. I'm a part of the team that runs all the database inside the company. And there are a lot of them. Well, for now, I guess it's like 400 something just on Kubernetes, like data centers and everything else. So really a lot of them. And unfortunately, we have to do in the company something like that. So we have to run Postgres QL in really quite different environments. So we have to run in all data centers, rather legacy databases. We have to run just on the AWS. We have to run within a Kubernetes and so on and so forth. Fortunately, we have two open source projects that are helping us with this. It's Patron, you probably heard about it already. And Postgres operator Josh was today talking about this stuff. And yeah, recently we've got accepted for Google Summer of Code, which is really nice. Please check them out, of course. And actually, this idea, this situation that we have to handle Postgres QL in so many different environments actually kind of a source, kind of a source of inspiration for this talk. Let me explain. So what happens when you run some single Postgres QL service, single Postgres QL database and you don't know what's going on inside. So I mean, normally what's happened inside is pretty much covered with PgSTAT views, whatever, pretty much whatever information you can instruct, and people more or less aware about this information and everything is fine. But then at some point people realize, aha, but we run unfortunately Postgres QL within some particular operating system, like some particular Linux or whatever, which means that Postgres, of course, has to interact with this system. And this, on this kind of an interaction layer, we're kind of already missing some information. We don't monitor something. So what should we do about this? They usually question their headers and aha, let's monitor something global. Let's monitor, I don't know, CPU utilization or IO utilization, something like that. And then suddenly we realize aha, we're running within some particular C group, within some particular container, and usually already at this point people started to think that, yes, something's going wrong, we don't know what to do, what to monitor here. We read some blog posts, we've seen some C group file system, but we still have no idea what to do here. And then it goes even deeper. We suddenly realize that we're running within a virtual machine when we run this container, when we run this Postgres. And again, people are saying, oh my God, I'm not being there specially, I'm not whatever, I don't know. And then at the end, like a final of it, we're running, all this stuff was just basically a particular node within a Kubernetes cluster and people completely disoriented at this point. And what happened and what really bugs me is that usually at this point people are saying, aha, we're serious people, we're doing business here, let's, we don't have time to investigate it, so don't just forget about it. It's really wrong from my point of view, I really don't like this approach. And the thing is that nowadays in 2019, it's already done really, well, you don't have any excuses to do actually. And in this particular presentation, I will try to prove the point that whenever you see something interesting, whenever you run PostgresQL, or in fact some other applications, but in this particular slide, with this particular presentation, we concentrate on the PostgresQL, because it's, well, I mean, you know, as Andy Paolo says, that database are most important application in the world. And in this presentation, I will try to prove the point that it's important to realize how your PostgresQL instance, PostgresQL database, interacts with all those underlying systems. Unfortunately, the plan for this presentation is a little bit chaotic, so there is no plan per se. What I did here is basically I collected some examples, some interesting or useful examples from our own experience or from my own personal experience, where basically you could not get, well, you could not solve this particular riddle or this conundrum within the getting some information from the outside. So basically the point is, the point that I'm trying to deliver here, is you have to be a little bit more curious and step outside of the PostgresQL to take a look at what's going on. Yeah, if you still need some particular plan, probably it's going to be something like that. So we're going to talk a little bit about test, trace, perf, then it's going to be a different kind of protocol, SNCFS, BPF and extended BCF, BPF, and then at the end it's going to be, there are going to be a few words about the DevOps side of this approach and some few horror stories. Yeah, but first of all, as I said in 2019, you don't have any excuse about not staying curious. Because, of course, you've got the source code all the way down, and in fact I really encourage you to read the source code for PostgresQL and including for Linux too. For PostgresQL it's quite clear, because this source code itself is just beautiful, it's beautifully documented and it's easily to read and you can get a lot of interesting and nice information just from the source code. Unfortunately with Linux kernel it's a little bit different, but still all the new stuff, like something about C-groups or IO schedulers, it's documented really nicely. Plus, of course, we have tools that more or less people are aware of, as trace, well, GDB sometimes perf. Of course, Linux kernel, I mean, unfortunately, I have to say that this particular presentation takes in mind only Linux-based systems or freeBSD people who have Dtrace and stuff. So if we're talking about Linux, we have also Prokifas and more modern versions, Csfs, and we have also something called BPF, Berkeley packet filtering, an extended version of it, and we're going to talk about this in the future slides. So yeah, and right away a few examples. This particular nice example we've got really in quite inconvenient situation when we were preparing for Black Friday, and then Cyber Week before, and we had to sell a lot of stuff, and suddenly we realized that some particular database that was responsible for some promotion in the Nordic region sometimes started to show this error for queries that people were doing this against it. It's kind of a scary, but the thing is that unfortunately, since it's an error, you cannot troubleshoot it from the within-podgers itself. So you have to spin up something new. So let's imagine if we will be able to trace this particular error by just attaching a trace to it. So if we will attach, we will see quite clearly what's going on, because before it was kind of cryptic, and here we can see clearly that basically the situation was the following. We had a progress to a rather new version, like 11, where finally parallelism started to work, where finally all those parallel gatherers started to work to the full extent, and people started to do some analytical queries, all those parallel work started to create. To be creative, we can see that we were running here exactly parallel plan, blah, blah, blah, and then for every parallel work, we had to create a separate DSM, dynamic shared memory. Well, this particular, it was not anonymous, it was by this particular path you could find it, and the problem is that this particular fresh new database was running unfortunately within a Docker container, and by default, the Docker container limits this amount of shared memory by 64 megabytes, which is kind of not a big amount for analytical query, and sometimes, and in fact, a lot recently, people started to complain about stuff. Yeah, and this whole information you can see just from the third-party tool as trace, especially if you use this in the modern version of a trace, you have nice switch, nice option, minus K, that will show you that using Leap Unwind library, it will show you all these tech trace. You may be not aware as trace is a library, sorry, tool that allows us to see what kind of system calls your particular process is doing, and in this case, we are going to get all the tech traces that was leading to this particular system call. You may say that it's kind of a trick because it's an error, of course, error is quite a complicated troubleshoot from the application itself, and that's why we have the second next example. Let's imagine you run your Postgres query within AWS. In AWS, you have different generations of instances, so M4 and M5 and so on. What's the difference between them? They are running on different hypervisors. So far, it's just a background. Another part is that on the Linux, we have something called virtual data stamp object. It's a quite nice feature that allows us to get some system calls without switching to kernel space. So in this particular case, we've got a function called getTimeOfTheDay. Usually, it allows us to kind of gain more performance because we don't have to switch, and it's an overhead and so on and so forth. But then sometimes if you're going to connect to, if you're going to attach a trace to your PostgreSQL backend on some old previous generation in AWS, well, yeah, if you run, of course, not on RDS, if you run just plain instance, you will see that somehow we're doing this system call. We're still doing this switch, so it's not being optimized away. Using this information, basically, we can see that it's a problem. We're losing our performance. Why is it happening? And unfortunately, it turns out that not all the hypervisors are supporting this feature. In particular, Xen, unfortunately, does not support it, and of course, M4 previous generation of instances were running on the Xen hypervisor. And here, basically, just by analyzing what kind of assisting calls were you doing where your PostgreSQL instance was doing, you could realize that you're losing some performance just by choosing the wrong instance type, for example. Okay, but the problem is that, as stress, of course, is rather limited. It just shows you some system calls, and sometimes to check the performance, we need to spin up something more powerful. For example, we need to take a Perf, and for just background information, Perf is a tool that allows you to sample some particular process to see in which particular part of this process, this process spent most of the CPU time, or if you're using it in a little bit different modes, you can just check some performance monitor counters. In this particular example, I've just decided to perform some simple experiment. We've got one PostgreSQL. We were running a PgBench workload against it in two different modes. In the first particular mode, experiment number one, we've got a long-running transaction. It was a custom PgBench script, and we were running something rather long. You see that latency, latency was about 1,300 milliseconds. It's really a long one. I was using this very same script, 50% of the time, and last 50% of the time, I was using something really short. It was like select one or something versus select order, blah, blah, blah, for a long-running transaction. I was expecting that, yeah, we're running right now, we're putting less pressure on the system in the 15% of the time. I was expecting that the latency is going to be a little slower, but what instead, a higher latency? So what's going on? Usually, I will try to compose this presentation in the sense of like use case, theory and proof or not proof. So in this particular case, the theory was that those two different workloads are started to be in some sort of a contention for CPU resources. How can you prove it or disprove it? Easily. First of all, just a diagram, what do I mean? Let's imagine we have two CPUs, there's a short-leaning transaction, a red one, and on the second CPU, we're running T2, which is a blue one long-running transaction. And alongside, we have some sort of a CPU cache. And then for some particular reason, Lingon decided, Linux scheduler decided to pre-empt T2 to give some time to T3 transaction. And then just because of some balancing reasons or some particular algorithm that built in within this scheduler, it decides to migrate T2 to another CPU. The problem is that, yeah, it's already kind of overhead this migration procedure, but also we have to unfortunately synchronize the cache itself, which is also, of course, an overhead. And how to prove it? In fact, super reason. You can spin up Perf and you can check important, interesting performance metrics in this particular case, where we're interested in the cache method, just to prove if it's cache-related or not, and CPU migrations. And we can see that in the first experiment and in the second experiment, we've got more or less the same amount of cache misses, so we'll do that. But at the same time, we see that CPU migrations in the second time was about three times more. And that's basically the proof of this theory that we were just, because of this, this balance between different workloads, we're just switching those different backends from back-and-forth, back-and-forth. And we can be even more fancy, we can just try to trace, I'm not sure how good you see from here, but basically it's the very same Perf output for scheduling events. And we can see that for really rather short amount of time, we already managed to catch two migrations, which is pretty much a lot. Another nice example, what you can prove or disprove is the huge pages. Somehow it happened that huge pages, I don't know, somehow, they're kind of mysterious in the Linux world, people not always understand how do they work, what do they give to you. And to reveal this, to make it clear for yourself, you don't have to read some outdated blog post or whatever, and unfortunately it happens really, people are just reading some really outdated blog post, making some conclusions and then getting into troubles. So, to make it clear for you, well, yeah, we're talking here about only classic, huge pages because transparently it's a little bit different beast if you want, I will explain it a bit later. So, how can you make it clear how does it work? First of all, you can apply the first source of information, source code. In this particular case, you don't have to even read the source code itself of the Linux kernel, you can just go to the documentation section and you will see, aha, huge pages are allowing us to get a more memory pages instead of 4 kilobytes, 2 megabytes or 1 gigabyte, which means that we will get a little bit faster transaction look-aside buffer misses, and they're going to be just less of them, just because we have to manage less amount of pages. So, we've got this information, now we have to prove is it true or not for our particular instance of PostgreSQL. Again, it's super simple. In this particular example, I again performed two experiments. With the very same PostgreSQL, in fact, this particular was on my laptop, but still now at least you can see the difference. And this particular instance, it was using like 8 gigabytes of memory with or within shared buffers. And then, of course, there was some pgbench against it and we can see that we were measuring data transaction look-aside buffer lots and stored misses. And here we can see clearly that with huge pages we've got about 19% less of lot misses and about 30% less store misses clearly. And from this point you can see, aha, indeed, it's true for our particular application we can benefit from it. Let's see how does it transform to, for example, latency or something like that. So, the idea is to not start using pgbench right away and try to figure out something. If you're looking for something particular, you can measure it and then cut away all the noise that could happen from pgbench itself. For example, it could happen that, I don't know. You were performing some pgbench workload and then something else was staying in the way, I don't know, network or something else and you've got the wrong results. It doesn't really matter because you can just measure particular thing, particular feature, particular metric you are interested in and then draw some conclusion. Another nice example, again, it's quite relevant for virtual machines and virtual clouds. If you run your particular database instance within a virtual machine, you've got something called lock-holder preemption. I can explain it just within this simple diagram. So, let's imagine we have a hypervisor and every hypervisor provides some amount of virtual CPUs that somehow are mapped into the real CPUs. And let's imagine that we have four virtual CPUs and two of them are active at this particular moment in time. And those two CPUs are running some particular backend. Yeah, so let's imagine that somehow C1 is running some, is doing some actual works of backend is doing something and somehow C2 for just a short amount of time stuck within a spin lock, waiting for the results of the C1. Normally, it's not a problem because spin locks are quite short one and yeah, of course, they're going to be released soon and so on and so forth, but then at some point hypervisor said, aha, C1 got enough time. Now we have to give some time to C4 and what happens to be what's supposed to be rather short amount of time happens to be completely unclear amount of time here when we're waiting on spin lock and basically we're wasting our CPU resources without realizing that. This problem is so weird that it has to be, it was addressed even on the CPU level and hardware level so that Intel and AMD, they designed something called Pulse loop exiting within their hardware. It's basically a feature that sends VM exit every time when we can see something similar situation when we can see some particular backend just stuck within checking spin lock over and over. But of course, as everything has every feature, it has some well, pros and cons for your particular application for PostgreSQL. And again, you can measure it easily. This particular experiment again just two particular experiments PostgreSQL, some PG Bench workload but now this was happening within a KVM virtual machine and we were in the first example, we had just a default configuration for this Pulse loop exiting and in the second example, I was just turning it off completely. And then we were measuring how many KVM exit events we've got from PIRV and especially we were interested in the reason Pulse instruction. So basically when we've got this Pulse instruction please stop this particular visual CPU and give some time to another CPU. And we can see that in the first case with the default configuration we've got latency reach about 17.7 milliseconds and in the second without this feature, we've got 16.8. So basically 1 millisecond less, which is quite noticeable which means that in this particular case, it was hurting your performance because CPU was so much saturated that all this kind of a waiting on spin locks and everything else was indeed a real work but unfortunately, VM exit was not really realizing it, it was still sending it in anyway. And here we can see that 1 millisecond is rather a significant amount of time and sometimes it could be good for you, sometimes it could be bad. But again, you can measure it easily, which is nice. Okay, but the problem is that everything that I showed before is kind of a stateless measurement. So as trace per their kind of providing you information that happens at this particular point and this particular moment in time and sometimes we need something more stateful. So now in this particular section we're going to talk about this Berkeley packet filtering. But first of all the reason why it can be convenient for you, why it can be important. So let's imagine we decided we somehow figured out from the documentation or whatever about configuration for scheduler called wakeup granularity. Basically it tells you how frequently scheduler will touch your process to see whether it's running or not, how frequently your process is going to be kind of an interrupted to check its status. And for example in terms of PostgreSQL we can see, aha if you for example doing some long running operation, I don't know down for something like that, we know that it's going to take like several minutes it doesn't really make sense to interrupt it over and over from the scheduler point of view. So maybe we can somehow adjust this parameter to get some more better performance on the very same hardware. But unfortunately it's quite complicated to measure and even if we would be able to measure it within a pair for example we get so many events we would not be able to process them plus every time we've got this we have to do this switch between kernel space and user space it's going to be just terribly slow and that's why we can utilize this Berkeley packet filtering. So just a little bit of diagram and again what does it how does it look and how does it work. So Berkeley packet filtering in fact existed in the Linux kernel for since 90's or something it's not really a new feature but before it was kind of limited so basically it's a bytecode within your kernel that you can execute every time when you've got some event. Usually it was executed for example for to process something stateless for example TCP package or something like that. But then extended BPF introduced really nice powerful feature for us and now we have registers we can use some stack we can store some information in the maps and everything happens within a bytecode within a kernel level itself. Which means first of all we don't have to switch it really fast we don't have to switch it in user space to deliver this information and the second nice thing is that we can attach this particular bytecode to really literally every function that's been exposed within your application PostgreSQL or within a kernel of course it was if it was not optimized the way, if it was not static or something like that. For example we can for example literally attach to virtual file system read and see what's going on and let's imagine let's return back to the situation about wake up granularity how does it help us pretty much is in. So again another experiment here there was a PostgreSQL running on the bare metal and then there was a PgBench and just some PgDump some long running transaction long running workload so here the amount of time that we spent for doing PgDump it was like 1 minute 38 second quite long amount of time and then using the very thing I was talking about we were able to gather the information about how much time in average this is a histogram you can see how much time we spent on the CPU without being interrupted and by default with a default configuration for wake up scheduler granularity we can see that most of the time we spent quite short amount of time in nanosecond between 32 and 63 nanoseconds without being interrupted and we can see that I mean we spent doing backup like more than 1 minute it doesn't really make sense for backup so let's address it let's make it bigger wake up granularity and again we saw another experiment with the very same setup but just a different wake up granularity and now we can see that we somehow reduced this dumb time for 6 seconds or so and we see clearly that our adjustment for this particular kernel option indeed worked now we can see that distribution changed drastically and we can see that we spent about 1000 to 2000 nanosecond without being interrupted we're doing more useful work and we're just kind of concentrated of course I mean you have to understand because that's why I was doing PgBench in the background PgBench unfortunately suffered from this a lot like 5% or so in terms of latency but nevertheless it's kind of a balance here trade-off yeah and what's important here is that yeah of course we can use extended BPF to get some stateful measurement but it's a bytecode after all so it's kind of complicated to write it manually and that's why we can use this really nice awesome tool called BCC packet filtering collection something like that that allows us to write just a simple Python script to extract all this metric that we want just to translate this particular script into a bytecode and then perform whatever we want it's really magical and what it does is also provides for example this utility that I was using for example to measure how much time we spent without being interrupted but unfortunately by default they're providing some set of tools that are I decided to create my own set for PostgreSQL so you can check it out and it's really important to check why it's kind of an important because knowing this information that we're working right now at PostgreSQL can give us really important information about what to measure for example let's just for a second imagine that we're running on Kubernetes and we realize that some particular ports they're running really tiny database and they're kind of in contention for CPU our last level cache and suddenly realize that there is IntelRDT resource director technology that allows us to kind of a slice our last level cache between different backends so that they're not going to be in the contention but for that we have to assign class of service to every particular backend and we have to understand what kind of cache misses pattern do we have and ideally we have to understand information in really low details and here is a nice example how we can do with this BPF feature so here we can just literally attach a turf event cache miss, last level cache miss and check this information per backends of the query literally it's super mind blowing in this particular case we saw that we've performed some update and we've got about 50% of hit which probably makes sense and then we're doing some select and we've got 95% of the cache, last level cache hit and of course we've got some total results after all, which is quite nice and remember this error we've got at the very beginning now using this stateful measurement we can kind of break down information that we know about shared memory from PostgreSQL even in a more detail so for example here we can see that we can run PostgreSQL and we can see that we mapped 132 megabytes of kind of a shared memory but it's in fact copying write memory and you can see that it's by default configuration 128 megabytes plus something that PostgreSQL decided to allocate for itself and then we can see, aha we allocated 56 bytes from the anonymous shared memory to keep pointer to keep pointer to this particular segment and then we can see that we allocated, well in this particular case it was kind of a fake information but this we we were allocating in this particular case not anonymous shared memory segment it was the very same segment I was showing in the before that was under the DEF, FHM and blah blah blah okay another really nice example that touches kind of an important topic since you probably know that PostgreSQL heavily relies on Linux file system and especially on such a feature called writeback so let's imagine first of all that we're kind of smart, we already know that Perv allows us to get a lot of information and let's imagine that we want to troubleshoot problems with writeback so what we do for that, we just have some PostgreSQL, we configured it in a way that checkpoint is never being run and now we started to measure how much pages do we write from the memory to storage so we can see that normally it doesn't happen almost at all but sometimes we can see those spikes rather huge spikes so since PostgreSQL not check pointing anything it means that Linux kernel does something so what happens basically if we run PostgreSQL and then there are some shared memories and some pages in this shared memory were dirty and they're marked as red, first of all of course there is a background writer that from time to time tries to run into operating system cache but then Linux tries to synchronize this information, this dirty buffers with storage it happens within some particular configuration and that's what we're trying to figure out and that's even more important when you read the source code and you see commentaries like that so if you run your PostgreSQL within a container within a C group well in this particular case they call it legacy mem C group which is a C group version one which is everyone is running in fact so it's all kind of legacy and they're saying that it's completely broken which is in fact correct because unfortunately in a version one there is a quite complicated connection between IO block IO controllers and memory controllers so taking this into account it's really important to get this information maybe this IO that you've got on your server is not really a real IO, maybe it's just IO that caused by this problem ok we've marked, we're trying to measure this performance metric called writeback written and here we can see, yeah indeed we've got sometimes periodic writeback and sometimes something called writeback in the background when basically Linux tries to flash all the information from dirty buffers to storage which kind of saturates our IO for storage which means that Postgres has quite hard time to write something in fact actually on the modern storages like NVMe or something when you have some amount of channels when you can write some amount of QC usually it's kind of a problem that's not important for you because you still have some capacity to write but it could be really worse the problem is that when a Linux kernel is not be able to keep up with writeback so you're producing more dirty pages with that storage it starts to inject IO timeouts which is really, literally you run some query, you want to write something to write a headlock and you have to wait because inexternal told you to do so and this information is kind of scary and you have to keep an eye on it but unfortunately to get this information is rather complicated that's why we have this nice script called IOTimeout that allows you to use this BPF feature again to get this information and in this particular case this is an example of running Postgres again with some amount of shared memory shared buffers on the laptop that's why you've got only four of them so it's not that much but you can imagine that on a server huge server with 256 gigabytes of memory it can be much, much more so it's something really scary sometimes or another interesting use case let's imagine that after I don't know huge party you are walking with a terrible hangover and you realize that you run your database on a Kubernetes it could happen from time to time so if you know if you want to configure your database in terms of resources there are sections you can save for your database request and limits so it's kind of a soft limit for resources and hard limits and we have a memory in both sections the thing is that oh that's a nice one, okay it doesn't matter the thing is that in a C group so you know that basically Kubernetes is an orchestration mechanism it runs all those containers and within the containers we also have soft limits and hard limits and my idea originally was most likely they are convert well in this case one to one they should be like that unfortunately sorry for this misplacement but nothing that's tried forward to Kubernetes so you have to expect something like that of course it's not and here there was my theory most likely it's good for us because it means that we don't have this problem about page reclaim so if you have a soft limit and we're going above this page this soft limit Linux kernel tries to reclaim some memory to kind of can find your memory consumption and of course it's not correct this particular theory could be really easily disproved when you're trying to measure this particular thing you can use this script called page reclaim you can attach to our database and you can see that still we are reclaiming some memory which means that every time when you write something using for example I don't know when you're dirtying some buffer and shared memory it's something you have to recline memory from a container which means it's of course an overhead and here is another nice example and that's why it's super important to use sometimes this BPF because usually normally when you're monitoring some global resources like IO or in this particular case page reclaim it could be rather complicated to check how much is from this particular numbers I've got are really related to this particular container or whatever else and with BPF we in fact can easily figure this out because we can just filter by this particular container yeah and the last section yeah we're kind of good cool yeah and the last two in fact sections so it's about how to run it because it turns out it's not that straight forward unfortunately because this feature with this approach with this extended BPF on your local machine is rather easy for that basically you need to make sure that you have properly configured Linux kernel when this feature is in fact enabled but nice fact is that by default almost all modern distributions are enabled so you don't have to do almost anything and you have to also mount a debug effect and you have to what's also important to be able to write and well of course write and read from this so locally it's super easy you install libcc you install scripts and you start to play with your local instance it's okay but now you want to run it you want to really apply this information in your infrastructure and it's getting a little bit more interesting so first of all if you run something within the containers and you want to for example sample it with Perf you've got interesting situation you've got a debug symbol in one place and Perf is trying to find them in another place you have to first of all you have to synchronize them of course if you run it in Ubuntu for example you can just copy a directory of userlib.debug but what's more important is that if you want to be able to read and write from debug.fs you have to unfortunately get some privileges I'm not entirely sure I have to still check documentation but so far I haven't seen capabilities that defined by Docker itself whether you can run it or whether you can read it and write so usually the easiest way is to run your container with a kind of a privileged mode it's basically some sort of a super user mode where you can run and read all the ccfs and whatever debug.fs you have in your host and plus as a nice product you can also attach to particular container you want to troubleshoot so I mean of course if you for example run your database within a container you don't want to pollute this container with some extra information so you can put your container at that and nicely troubleshoot it okay and now with Kubernetes it's in fact getting even more complicated and in fact it took me like about few months to figure out how to do this within our infrastructure so first of all again you have to run this privileged mode privileged containers but on Kubernetes it's a little bit more tricky for that especially I mean most of the time unfortunately people are disabled these features just run me privileged container and that's it but usually it is disabled and this feature is disabled by default and that's why I have to use a privileged service account when you say that I want to attach to this particular host machine and I want to run this particular container with the privileges with some particular privileges and then I would be able to run all those scripts but it's not enough it turns out and it's quite frequent situation when you've got I don't know for example you're running your podcast called something something and then Kubernetes itself runs all the services on some core OS or whatever and which means that basically your kernel variants are not really fit each other they could be different within really small numbers but still it could be different and unfortunately in this situation this feature BPF doesn't work because it requires client and kernel fit exactly but fortunately and I found this kind of a card to find in the library documentation so I found out in a source code basically you can overwrite if you're sure that your scripts are going to work nicely with this particular kernel version you can just overwrite it with the scheme so you have to take the version number 4 then the second number 14 and 96 and multiply by these numbers and then you get a code that you have to provide as an environment for able to overwrite behavior in BCC ok but the thing is that all I've told you before it's really cool it allows you to do crazy shit that's why I'm saying it's really important to stay curious to try to figure out this stuff but you have to do this really really carefully from my own experience so in this particular example I was trying to apply perf to check how frequently one particular trigger function was called in PostgreSQL and I also wanted to extract some data so I wanted to extract as far as I remember in this particular case just the name of this trigger for that you can create a user probe inside PostgreSQL binary blah blah blah extract and here's the really important stuff you could access some information by pointers and in fact it's terribly scary because it could be that for example there's a null pointer in there and you're just going to crush your entire backend just like that yeah basically it means that your transaction is going to be well not corrupted but basically abort it and so on and so forth and client is going to be disappointed another nice stuff but basically reminds you that all the software that we're using even such a low level software as a Linux kernel unfortunately is not bug free so in this particular example again from my own experience I was trying to figure out how frequently one particular database is writing write a headlock and for that I was creating a probe for xlog insert record with LSN and so on I was trying to figure out how frequently it runs plus I wanted to figure out how frequently we're doing a full page write in fact after all this research kind of give us some interesting information we found out that this particular database was doing this extensive write a headlock IO unnecessary absolutely so we figured out just by profiling but in this particular it was really scary because unfortunately there was a bug in Linux kernel well in this particular perf implementation because perf is also part of the Linux kernel and this bug unfortunately triggered this perf to be in a non-interruptible sleep within the kernel space itself waiting for some event in a debug affair that never appeared what does it mean for you it means that basically you've got a process that you cannot kill and all the rest processes are unfortunately kind of waiting for it which means that you cannot restart docker container you cannot do anything basically you stuck in this particular case we had to re-initialize replica to get somewhere from it and the last example is that yeah it's just a screenshot from the repository where BCC leaves and it's further remember it's from 2000 yeah from April 2018 where it turns out that when it turns out that relatively old Ubuntu with the Linux version Linux kernel version 4.4 something unfortunately from time to time it calls kernel panic so I mean you imagine you run Python script and you've got a kernel panic absolutely mind-blowing but sometimes it happens because unfortunately it works with really really low level stuff yeah and at this point I guess that's it I hope you have really really a lot amount of questions and yeah thank you so any questions seriously come on yeah just a second there was a mic here did you say that there was a safer way to access potentially no pointers with ppf well kind of the thing is that ppf itself since it's a bytecode it's kind of a scary thing because you have to run it with an kernel that's why Linux kernel says it performs a lot of checks already including no pointers checks which means that for example if you're using this BTC library every time when you want to read something like that from the memory you have to perform you have to call a special function that in fact will kind of check all this stuff out but it means that you're going to write a little bit more complicated code but it will be checked for you yeah yep one question yeah I'm surprised yeah I guess I'm probably going to have a few questions but we don't have to go over all of them and talk offline but I did have a question about what do you found has been like the most difficult kind of thing to run through to figure out what was going wrong like most complicated issue you run into with I mean in particular I'm thinking about on on kube well on kube or netis itself we've got I mean it's kind of a complicated stuff because some things are well for example with this shared memory thingy it's kind of obvious because you can't find out even with S-Trade but at the same time it's a problem because the fix it you have to do a lot of stuff but in general I'm kind of in the process of this research because it's really interesting when let's see kube netis itself kind of documented already but there are so many cases so many interesting stuff so I probably create a separate talk about this stuff to talk about this yeah but in general most interesting more complicated use case that I stumbled upon in my experience I in fact I started quite recently just one year ago it was this thingy with an x-log insert when we've seen that one particular database was really rushing with I.O. and sometimes even it prevents archiving from doing this job and so on and it turns out that it's just a really interesting use case when you have a port grisky with enable check some and then with enable wall hint and with this particular workload when you're basically doing write a head log super frequently and sometimes unnecessarily just because of hint bits and it was really complicated and I took it took me like few weeks to figure out what's going on inside yeah okay if you don't have any other questions then yeah please one more question the case where you had the long and short transaction workloads yep I think you ran through that really quickly and I wanted to clarify my understanding of process affinity so I thought that the scheduler wouldn't preempt a process into another CPU because it would try to keep affinity between the process and the CPUs running on that's a good thing the thing is that there is a threshold and you can configure this threshold in particular workloads this threshold could not be enough and that's why you actually advise to configure this threshold sometimes it could be the distress not enough and then you could switch back and forth this one is default complete first schedule okay great I tell you I'm great at questions well I usually am what could Postgres are there things that you think Postgres could do or could be changed in Postgres to make some of these things easier to debug or possibly easier to monitor or possibly easier to just fix well that's a rather simple and complicated question at the same time we have some amount of already traced point within the kernel with the Postgres field itself of course it's nice to have more of them but it's nice for Perf itself unfortunately in BCC right now we have an interesting situation when you cannot use these probes these dynamic trace points because for that you have to trace only one backend and if you want to trace the entire instance you just cannot do this because it requires one particular post ID so it's kind of a two-side thing you have to ideally add more trace points in Postgres field but plus fix this problem in BCC itself and in fact there are there is a kind of a discussion about this right now in this mini community in BCC but it's kind of it will require some as far as I understood some major rewriting so maybe yeah maybe it's a long way there unfortunately okay if we are out of questions then thank you yeah please remind you that stay curious and don't forget that all the hacking is all about fun please yeah thank you