 Ок, привет всем! Меня зовут Алексей И сегодня я вам расскажу about Postgres query statistics Какие проблемы мы можем с ними поделать? Я покажу примеры queries И что это значит И что делать с ними Также здесь есть подлинк с презентацией, вы можете доноточить и попробовать queries на вашей installation И вещи, о которых я не буду говорить Я не буду говорить about query planner statistics Postgres activity statistics and query planner statistics are different topics Also, I will not talk about utils and tools that works with stats only psquare and plaintext queries I want to go out of this talk with a few take home messages I want to convince that the postgres query stats is useful and solves many problems By default, postgres stats are provided in plaintext form and may seem the stat is to be inconvenient to use but this isn't true I will start from simple queries examples and you can see that the stat isn't scary Examining these queries examples I will show which stats should be used in particular cases So, let's start What is postgres from system administrators point of view If we use system activity tools such as stop or similar we will see some processes We can estimate we can see how many resources they consume CPU, memory, swap, discord, network, utilization but nothing more If we want to understand what is going on inside the postgres we should use special tools and details and these tools provide an interface to postgres stats Postgres is a complex piece of software that is comprised of many subsystems Let's have a look at this picture and try to answer the question where postgres spends its time and here we have two answers First is the service and client queries and second answer is the background operations Let's review both cases and begin from top Imagine that we have a client application that works with the database For any client postgres creates dedicated process which we call a backend The application sends queries to the backend The third query plans them and executes Next running a query requires reading or writing some data to tables and indexes This is being done using the shared buffers If it's a simple select query data is read from disk into the shared buffers and the result is returned to the client If the data modifying query is such a update, insert or delete changes must be committed and all data changes are written to write a head log In this way postgres works with clients Next we have background operations that are invisible to the clients These are query logging flashing dirty data to disks auto vacuum processing write a head log logical decoding or sending to streaming send buys As we can see postgres isn't a simple thing and if a problem comes up in some part of postgres it may influence on other postgres parts And here postgres starts will help to track postgres subsystem activity estimate its effectiveness and suggest some optimizations but there are a few problems with stats First problem there are too many stats in postgres For example in the current release there are over 100 items and number of items directly depends on number of databases tables and indexes in the whole instance The second problem is that the stats are provided as permanently incremented counters and you can see billion values in stats and these values say nothing to you It's very hard to track changes in very large values Third problem is that there is no history for stats There is no simple ways to see stats of 5, 15 or 30 minutes ago You can use third party tools which allow to solve this issue but postgres doesn't have tools for that Next problem is the absence of building tools for working with stats You can use only sql, psql and self-written sql queries Yes, it may be flexible but also too hard for beginners And this problem entails the next problem and a lot of monitoring tools for postgres are available and they have different sets of features Often you need to use two or three tools since one tool would have features that another tool doesn't Well, a simple way to not suffer from these problems is to know what is provided by stats and how to use it as is It's important to be able to use stats directly from postgres without any special tools It requires basic sql skills for writing queries Stats are based on counters on shared memory and every counter is associated with a specific event and increments when receiving it occurs Special functions are used for extracting values from these counters and there are many such functions and to make them usage easier functions are organized into views and by default there are about 20 different views You can create your own views with your own set of functions In addition to building functions we can use stats from additional extensions There are official and unofficial extensions and official extensions are collected in the country package and can be installed from official postgres repo Unofficial extensions is not supported by postgres team shipped without any warranties and should be used at your own risk On this slide we have stats functions and views that provides various stats for database objects Not to get lost amidst these names we can use previous diagram Here you can see which stats functions or views are associated with postgres subsystems Now I will talk about the most interesting views and explain how to use them and what help they can provide to us First of all there is PGSTART database that shows information for all databases in the current postgres instance There is a summary information about number of tuples that were written in the database number of commits and rollbacks information about temp files, recovery conflicts deadlocks and others So how to use this information and using this information we can estimate cache hit ratio Cache hit ratio is effectiveness matter for shared memory which is known as shared buffers It's obvious that reading from memory is faster than reading from disks and when with cache hit ratio we can get an idea whether we have enough shared buffers Using the blocks hit and blocks read we can determine cache hit ratio in persons Really this value should be close to 100% Values below 90% say that the amount of shared buffers isn't enough for current workload and postgres is forced to read data from disks and evict hot data from cache Now that we can see low cache hit ratio immediately after postgres restart this case is tracking anomalies such as rollbacks, recovery conflicts deadlocks and temp files By anomaly I mean an unwanted event happening in the database and not bugs As with cache hit ratio we can calculate commit ratio using transaction commit and transaction rollback values If the number of rollbacks is too high you should direct your attention and investigate why too many operations and with rollbacks Anyway it's bad behavior when application does something change data or spends time and in the end rollbacks its results And on this slide I put caution in values and if you see such values you should be worried It's also important to track recovery conflicts and deadlocks which cause and eliminate them Recovery conflicts are related to streaming replication and this means that the queries, variance and back has been canceled Detailed information about conflicts can be found in server logs or in PgStar database conflicts view Deadlocks are unpleasant events too and detailed information about them can be found in server logs For detailed information about log tracking you should enable the log log weights option in PostgreSQL Conf It helps to get a clearer picture of when deadlocks occur and a little bit about temp files Temp files are always written to disks and queries that use temp files become slower You should track queries that creates huge temp files and optimize them if it's possible Temp files option is enabled You can find detailed information in server logs or use PgStar Statements extension About the latter I will talk in a few minutes Next important view is PgStar BGWriter With this view we can estimate effectiveness of checkpointer and BGWriter These are the ground processes that perform very important tasks First task is a checkpointing This is a regular operation related to data synchronization between shared buffers and data files and creating checkpoints, special records and writer head log Second task is a permanent writing of dirty data to disks so that the backend handling user queries seldom or never need to wait when writing takes place Moreover, BGWriter helps to understand checkpointer process to reduce amount of work for checkpoints For controlling efficiency of these processes you can use PgStar BGWriter view There you find information about execution checkpoints their times, spend time and buffer area Endless start from checkpoints As I mentioned earlier Checkpointing is a process of synchronizing all dirty data from shared buffers to data files on disks and creating special records in writer head log when sync is done There are two types of checkpoints Required checkpoints and checkpoints by timeout Checkpoints by timeout occurred with specified time intervals and Required checkpoints takes place when Postgres writes specific amount of writer head log Checkpoints by timeout are to be preferred to require checkpoints and require checkpoints take place when database is under heavy write load that is when huge amount of data should be written to disks When a checkpoint occurs it creates additional load and can cause performance to drop severely and the pg start bidger writer helps to determine how many checkpoints are being executed and what is the type If required checkpoints values are too large we can increase the amount of writer head log that is required to trigger a checkpoint Postgres query conf has several options for controlling checkpoints These are minimal size checkpoints timeout checkpoints completion target checkpoints segments Last option is actual for Postgres versions before 9.5 In 9.5 release checkpoints segments were replaced with minimal size and maxful size Thus increasing checkpoint timeout or maxful size we spread checkpoint execution in time and reduce load that is created by checkpoints Next with this start we can estimate bidger writer effectiveness The main goal of bidger writer is to write out dirty pages from shared buffers to disks and it is writing pages by batches and the size of a batch isn't constant If the batch is big and too many pages have been written bidger writer is forced to pause and this is required to reduce the impact on IO So we can use buffers clean and max written clean values to tune amount of work for bidger writer If storage system is fast and we can add extra work for bidger writer with increasing max pages and the multiplier parameters or reduce these values and increase delay when we want to reduce the impact on performance from bidger writer Also Special attention should be paid to buffers backend of sync This item shows how many Syncs is calls are called by backends As you know Sync is a heaviest call and if each backend will do things, this will negatively influence on performance To this end backends try to delegate all things to the checkpoint or process It has sync queue and its size depends on the size of shared buffers In the worst case when sync queue is full backends is forced to do things by itself It's a very unpleasant situation and it means that there is a serious problem with storage So if you see values more than zero it means your storage is overutilized And next view is the replication that shows information about connected streaming replication in standby Here you can find information about remote host parameters and group of locations values Vacations are a very important piece of information and that helps to check for problems with standby The main goal of this view is replication lock tracking Replication lock is the situation Standby server is behind the upstream and has no time to replay all write ahead logs received from upstream Every location describes a specific position in write ahead log and here we have 4 positions These are sending from upstream to remote host writing, flashing and applying logs on remote host Thus calculating difference positions we can determine bottleneck and know that the ideal situation when all positions are well And replication lock can be caused by several problems such as overloaded network or errors on network storage or CPU Network becomes a problem in case when upstream needs to send too much data to standby over network with low throughput Overutilized storage when postgres has no time to write flash or replay received data And third very rare case is when CPU is bottleneck Problem here is that recovery process that replace lock is single core and in rare cases it can utilize core almost to 100 persons For example when lock contains small interval of times our host has very fast storage subsystem And below show queries examples The first one is tracking write ahead logs position in human readable decimal format If we measure these values with some interval of time and compare them we can get an idea how much write ahead logs are generated by our server For example for replication lock tracking in bytes and in seconds Here we use additional functions pg current transaction location xlog location pg current exact replay time stamp pg xlog location div And a few words about lagging seconds This query can run only on standby server and has one drawback If your upstream server query show growing lag and in fact it's wrong Here is another example of query which I used very often Here I'll calculate difference between each position and the total lag in the end Next Next view is the pg start all tables The chose starts for all tables in the current database This view includes starts about users and system tables Each database has its own view with its tables With this view we can get a picture of tables usage There is info about tables access, sequential and index scans and how many tuples were returned by these scans What operations around tables and how much tuples are affected Also information about life and that tuples helps to understand when auto vacuum should start for this table and of course auto vacuum and analyze counters and time of the last execution and there is too much information that it don't paste at all And the simple thing that we can see here are the sequential access tables sequential scan takes a table and reads it from start to position when the query condition is satisfied With big tables it can take a long time It's important to see how many tuples were returned by one sequential scan For example in LTP workload if one scan returns more than one or ten thousand rows you should check query, explain it rewrite it or probably create new index Also with this view we can estimate table size Of course we can use meta comments but they are not so flexible as your own queries For example we can use ordering, grouping, filtration and various conditions For example show the large tables or tables with indexes or tables with large indexes Pay attention Here we use additional functions PG size pretty and table relation size Next place where these stats can be useful is checking for tables write activity Write operations are heavy operations and if it's possible to make them more lightweight it should be done Update operation in fact consists of two operations delete and insert When tuple is updated all tuple is marked as removed and new version of tuple is inserted Additionally index entries related to these tables also should be updated Thus tuple updating is a demanding operation and for reducing write overhead hot updates were introduced in Postgres 8.3 This avoids index updates in case when non-indexed values are updated For viewing stats two fields are available and tap up and tap hot update Using them we can determine hot updates profit Thus we need to answer questions how to increase number of hot updates For increasing number of hot updates we can play with fill factor option What is fill factor? This is a page packing first ratio and can vary from 10 to 100 persons It works in the following way When new tuple are inserted in the table and the new page is allocated this page is filled by new tuples according to fill factor If fill factor is below 100% new page will have free space that can be used later for update operations New version of tuple can be written in the free space and in this way hot updates will occur more frequently and it's positive fact for performance But there is a drawback in fill factor The lower fill factor you use the more space tables will take Another thing which can be useful is viewing list of tables that should be vacant Unfortunately Postgres doesn't offer a simple way to show this information So we will use slightly more complex way To do this you need an additional function current settings that shows specified Postgres setting But how does Postgres know that the table should be vacant When AutoVacuum worker starts it builds list of tables that should be vacant Analyze it or both It will process tables that have a number of tuples larger than a threshold The threshold is based on the number of tuples in the table and following settings Vacuum scale factor and vacuum threshold Tables to analyze are determined in a similar way but instead of the tuples number of changed tuples since last analyze is used Thus we can write a query that shows this information If you count rows in this list we can get something like AutoVacuum QSize But know that these parameters can be rewritten for particular tables with alter table command And this query doesn't account for it But our colleagues from Aveda company implemented a plugin that accounts for this specific settings The query used there can be found in this URL Fortunately in Postgres 9.6 StatsCollector will have dedicated functions that allows to track AutoVacuum activity and progress Next view is the PgStartAllIndexes This view is similar to the previous one and describes indexes stats It helps to estimate how actively indexes are used Here we see index names and how many tuples are returned Practical usage of this view is searching for rarely used or unused indexes unused indexes in this database are very bad they take space and amplify write operations because write operations concern not only tables but indexes too Also indexes are vacuumed during tables AutoVacuum Basically we need to check database for unused indexes and remove them For that we should find all unused indexes with zero index scans Found indexes should be remembered and removed But be careful here indexes that are not used on master can be actively used on send buys Be sure that the indexes are not used on all hosts Also I recommend to pay attention to the following URLs There you can find useful and extended examples for unused queries searching Next and very useful view is the pgStatActivity It contains information about connected clients and queries that run at this moment Each line in this view corresponds to a Postgres backend In it we have various information such as backend pit remote host information start time of backend transaction or query backend states current or last executed query The most simple thing we can get from this view is the information about overall database activity For example number of used connection It can be useful to track when number of connections is close to the limit With this query we can see how many connections in persons are open to the database Also this view can be used for tracking how many and what users are connected to the databases It's useful when we need to know who hijacks the data and what users It's useful when we need to know who hijacked all the connections And in this example CronRoll is the bad boy Another way to use this view is checking for long transaction and queries Here we need the following columns backend start, transaction start, query start and state change For calculating the age of query or transaction we need to use an additional function clock timestamp which shows current time Calculating the difference between what values we can get the age of a transaction or a running query And what should we do with query or transactions that take too long Remember them, terminate and optimize queries or applications that executes them In LTP workload long transactions are unwanted and you should avoid them Problem are bad transactions that are opened and then get stalled in an idle state This can happen in a situation when application opens the transaction with begin command does something inside it and leaves the transaction without commit or rollback It's an unwanted situation because in this case the table and the indexes bloat This may affect performance negatively For searching for bad transactions we need to use the state column Check for transactions that are in the idle or aborted state If you see many idle transactions with age of dozens minutes and more you should terminate them as soon as possible Next, your application should be optimized to do the transaction processing in a proper way Next things that can be tracked with pg-style activity are waiting clients Pay attention to the waiting column In a normal situation all values here should be false If there is a true value it is bad It means that the backend is blocked by another transaction or query and waiting for the blocking transaction to end While it's waiting the backend does nothing useful From the user's point of view the database seems to hang And what to do in these situations Waiting transactions or statements should be canceled or terminated Application code that runs blocked statements should be optimized To see the details you should enable the log-loc-waits option And when it's on, additional info will be written to the PostgreSQL logs Also pg-logs view can be used to perform an immediate investigation when a log happens And to continue with blocking queries a few words about pg-logs view This view shows information about current logs Using queries from these URLs you can understand which queries blocked one another And when blocking statements are found you can cancel them and the block statements can continue their work In this example there is blocked update operation and it waits for alter table But alter table goes into idle and transaction state Thus, canceling alter table will solve the problem and update can continue Yet another useful stats are provided by pg-stat statements module That is included into the official country package Before using pg-stat statements you should load a library and create an extension in a database When all conditions are satisfied and the extension is created stats are gathered permanently and can be viewed in the pg-stat statements view This view contains one role for each decision query user ID or database ID Also there are other useful stats related to these statements number of calls, total rows that were returned timings and resource consumption There are higher stats how many blocks are read from shared buffers or disks how many buffers are written or distorted Stats about I.O. related to local memory and temporary files and finally timing for I.O. operations There are many ways to use these stats For example, we can calculate average query time for a Postgres instance It's a database's common health metric which can be used to track database's response time Accurate this average time will go up Also we can analyze I.O. stats For example, to find statements that write data to the shared buffers more frequently than others or statements that created too much temporary data or use a lot of local memory Another way to use pg-stat statements is to make queries reports Model has functions for researching stats and it possible to gather stats for a period of time and make them into the reports This way of query reporting is more convenient than log parsing It's fast and doesn't consume much CPU or your resources A common way in this case is to calculate summary stats for all statements We can calculate the stats for particular statements and determine the ratio in the summary stats In this example the select statement has a total execution time of over 15 hours that constitutes 14% of the summary time Using the total time block read time and block write time cons we can calculate CPU or block IO time usage Thus we can make a report with top problem queries and optimize them Well, there are a lot of stats and unfortunately I have no time to describe all of them I will just briefly say a few words about other useful stats Besides stats that I have mentioned so far there are stats for tables and indexes buffer IO PgStats IO all tables and PgStats IO all indexes Also there is a view which describes functions usage stats PgStats user functions There you find the number of calls and execution time Also there are many functions for counting sizes of databases objects All functions can be listed with DF metacomment in Pesquale Official country package hasn't only PgStats statements but also two other interesting modules First is the PgStatTaple that allows to estimate the bloat in indexes and tables But be aware estimation time directly depends on tables or indexes size and consumes extra IO Second module is a PgBufferCache that allows to inspect the shared buffers It shows how many free or dirty buffers are there and their usage But using PgBufferCache functions can cause performance impact, which is caused by shared buffer scheme Besides official country package there are many third party modules for example PgFincor or PgStatkaCache They are unofficial modules and they are not in the country package PgFincor includes functions that helps to perform low level inspection for operation systems PgCache and determine which tables and indexes are in the PgCache Another module PgStatkaCache allows to determine CPU and disk utilization by square statements This is possible using geter usage system call that is involved before and after executing statements This module doesn't have serious performance overhead, but to use it you should Postgres 9.4 and PgStat statements enabled in your database Well, that is it and I want to sum up a few take home messages First, as you saw statistics can be very useful It allows to keep an eye on your database find and prevent different kind of problems and well next Second, stats in Postgres isn't difficult or complicated Third, stat can help you to answer the questions related to performance issues or problems in the database And last and important try to experiment with stat write new queries change them, optimize try to construct ideal queries for your Postgres Stealth information can be found at these URLs that I highly recommend and that is all Thanks for your attention and if you have any questions feel free to ask them and my colleague Lya will help me if I can problems with communications That is it Maybe you speak louder You had it read when that number was one thousand Yes, read number one thousand this is okay Can you help me? I take tuples in red if more than thousand everything is bad If you use OLTP or gloat we can see queries that return tuples and if query return too many tuples and it consumes more time if more tuples then more time and in our experience we consume we use this limit in one thousand tuples ask a red mark for any questions yep