Postgres auto_explain thresholds

Back and blogging about postgres again! We do other things too, it’s just that postgres is a really cool product that we’ve had time to learn a lot about recently and it’s interesting to write about.

When we configured postgres, we set up logging for slow queries, but we set a really high log_min_duration_statement, going for 60 seconds. We’ve been running with that configuration ever since, but we’ve decided that 60 seconds is way too high to be useful, since nothing should ever trigger that. We’ve been wanting to lower it so we can be better at keeping track of slow queries. The first step you’re probably going to do when optimizing a query is to find the query plan that was executed.

Postgres comes with a neat module called auto_explain which can be configured to log the query plans for queries that run for more than auto_explain.log_min_duration milliseconds. We wanted both to lower log_min_duration_statement, and at the same time set up auto-explain to the same threshold. We weren’t sure exactly of how low is too low. The reason why this matters is mostly to avoid spamming logs and creating too much I/O on the database server. The overhead of auto_explain will stay low, as long as you do not enable auto_explain.log_analyze. The documentation warns you about this, saying the following:

When this parameter is on, per-plan-node timing occurs for all statements executed, whether or not they run long enough to actually get logged. This can have an extremely negative impact on performance. […]

We tried to investigate exactly how large this “extremely negative impact” actually is, but didn’t find too much. People we chatted with said slowdown by about a factor of 2 could happen. We honestly don’t know what that would do to the database server and prefer not to find out. So we chose to not enable auto_explain.log_analyze at all.

There’s another postgres extension that was really helpful when setting our thresholds. pg_stat_statements will gather statistics about your most common queries for you, if you CREATE EXTENSION PG_STAT_STATEMENTS in your database. This view contains several really interesting statistics about your queries, for example how often they are called, how long they take to execute (mean, max, min, standard deviation), how often they have cache-misses and how many temporary files they generate (ideally none!)

We decided we could use this to obtain a ballpark estimate of how many queries different thresholds for auto_explain would catch. We really don’t want to be logging many thousands each day, as that would make it difficult to focus on the big wins, so we’d prefer to start with a lenient threshold and then lower it as we improve queries. Trying to query against pg_stat_statements also gave us an opportunity to practice some pretty cool SQL.

We based our choice on mean_time and calls (total amount of query executions) and investigated what percentage of our total queries that had which average running time. We started by classifying each query in pg_stat_statements into different buckets depending on its mean_time:

select calls, case
              when mean_time > 60000 then 60
              when mean_time > 30000 then 30
              when mean_time > 10000 then 10
              when mean_time > 5000 then 5
              else 0 end as bucket_time
from pg_stat_statements;

This will classify each query according to its mean_time, giving it the bucket_time of 0, 10, 30 or 60 seconds, which is the lower bound of the bucket we put it in. We can build on this query to aggregate some useful information:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements)
select sum(calls) as calls, bucket_time from query_times group by bucket_time;

The with statement here is called a Common Table Expression and you can read more about it here. It’s super useful for structuring queries nicely to make them readable, and it helps with building up queries one part at a time. The query above produces an output much like:

   calls    | bucket_time
------------+-------------
        688 |          30
       1131 |          60
       4166 |           5
 6065699764 |           0
       1483 |          10

That’s already useful and will give us our ballpark estimate. We know that it’s probably safe to set our threshold to 5 seconds for a start, because that won’t catch many queries (688 + 1131 + 4166 + 1483). But wouldn’t it be cool to calculate some percentages so we can see how many percents of queries that fall into each bucket? We can do that with window functions:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements),
  grouping as
    (select sum(calls) as calls, bucket_time
    from query_times group by bucket_time)

select bucket_time,
    100 * calls / (sum(calls) over ()) as percent,
    calls,
    rank() over (order by bucket_time)
from grouping;

The sum(calls) over () part of the query will sum up calls for each column in the window, which in our case is the entire result-set. rank() will number the rows according to the ordering of bucket_time. In total, this produces an output much like:

 bucket_time |          percent           |   calls    | rank
-------------+----------------------------+------------+------
           0 |        99.9998768942781880 | 6066323073 |    1
           5 | 0.000068674134583396087978 |       4166 |    2
          10 | 0.000024446409406427364011 |       1483 |    3
          30 | 0.000011341287708443713041 |        688 |    4
          60 | 0.000018643890113735231758 |       1131 |    5

Wouldn’t it be cooler still if we could accumulate the calls and percentages according to ranks, such that each bucket also contains all the calls from the ranks that are lower than it? That way, we can say that 99.9999999% of all queries complete in less than X seconds. We can do that joining the previous query with a join-condition on the rank:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements),
  grouping as
    (select sum(calls) as calls, bucket_time
    from query_times group by bucket_time),

  stats as
    (select bucket_time,
      100 * calls / (sum(calls) over ()) as percent,
      calls,
      rank() over (order by bucket_time)
     from grouping)

select
    s1.bucket_time,
    s1.calls,
    s1.percent,
    sum(s2.calls) as accumulated_calls,
    sum(s2.percent) as accumulated_percent
from stats s1
    inner join stats s2 on s1.rank >= s2.rank
group by s1.bucket_time, s1.calls, s1.percent order by s1.bucket_time;

The result will look something like this:

 bucket_time |   calls    |          percent           | accumulated_calls |     accumulated_percent
-------------+------------+----------------------------+-------------------+------------------------------
           0 | 6066780909 |        99.9998769035684793 |        6066780909 |          99.9998769035684793
           5 |       4166 | 0.000068668952024004314466 |        6066785075 |  99.999945572520503304314466
          10 |       1483 | 0.000024444564534709169072 |        6066786558 |  99.999970017085038013483538
          30 |        688 | 0.000011340431827295959758 |        6066787246 |  99.999981357516865309443296
          60 |       1131 | 0.000018642483134697282684 |        6066788377 | 100.000000000000000006725980

Notice how accumulated_percent is larger than 100%? We think this is a combination of rounding errors with floats, and the fact that we’re actually running queries right now, which will add new entries to pg_stat_statements while we’re executing. Anyway, the acumulated columns should show how many queries that are at least this fast, on average. Choosing different bucket sizes is easy, only the first common table expression needs to be modified.

The result gives us some pretty good ballpark figures for setting slow query thresholds.

You should vacuum your databases

Postgres has a great multiversion concurrency control implementation (MVCC). MVCC a very useful concept, which allows you to do things like roll back transactions that do DDL changes. The goal is to allow for consistent point in time views of the database, while still allowing multiple transactions to read and write the database at the same time – without having to use expensive locks.

This means that when you update data in your Postgres database, the existing data is not modified. Instead, a new version of that data is added to the database. This is because it’s not safe to directly modify existing data, because other transactions could be reading it. At some point, the old version of the data is no longer going to be interesting for anyone, which means it’s time to clean it out. That can be done by issuing VACUUM in your database. Normally, you don’t have to worry about doing that, because postgres has something called the autovacuum launcher. It will periodically launch an autovacuum process, which will vacuum tables that it believes has a decent amount of garbage (“dead tuples”). You can read more about autovacuum here.

So far we’ve used default settings for autovacuum, but we recently started investigating some slow queries and found out that the tables had not been autovacuumed or analyzed for a really long time (7 weeks). Manually vacuuming the affected tables cleared up lots of unused items and we’re paying close attention to check if that improves query performance. Below is some output from VACUUM VERBOSE ANALYZE:

There were 1829170 unused item pointers.
DETAIL:  1845 index row versions were removed.
49 index pages have been deleted, 92195 are currently reusable.

Investigating some more tables in the database revealed that most tables were being properly vacuumed and analyzed frequently, and the tables we were having issues with just have very different types of workloads. They contained many millions of rows and data was almost always only added to them, very rarely updated or deleted. It doesn’t seem like just doing INSERT on tables will actually cause autovacuuming to occur. In terms of accumulating garbage, that’s fine, because INSERT creates very little garbage. But the auto-vacuum will also normally take care of running ANALYZE, which will update the statistics that the query planner uses to make efficient query execution plans. Often, the direct consequence of having outdated statistics will be that the query planner will select suboptimal query execution plans. In bad cases, this can lead to a high performance degradation. In our case, we did see that the query planner would select better execution plans for a few hours after running ANALYZE, but it’s now back to picking the suboptimal plan.

That’s because we didn’t have much success with tuning the autovacuum parameters for the table to make autovacuum analyze it. We found this blogpost very helpful in understanding how autovacuum parameters should be set. It looks like we’ll need to set up a job to analyze the tables in some other way. The reason we think this is an interesting topic, is that we encounter this sort of problem as application developers, not as DBAs (we’re not DBAs). We only encountered this problem because the work-load for 2 particular tables in our cluster with 11 databases is very different from the rest, so autovacuum doesn’t really do a great job here. In very rare cases, we will batch-update a significant amount of rows (tens of thousands), but not significant enough to trigger autovacuum (because the tables have many millions of rows).

This is just one more example of the broad scope of knowledge a team needs to have in order to develop efficient, modern web-applications. We see cases where we need to know about linux wizardry to hunt down socket leaks caused by bad ssh config, flexbox incompatibilities between different versions of the same browser, css minification differences between different versions of the same minifier, double-submits in javascript causing database deadlocks and many other different odd pieces of knowledge and trivia. Now we’re adding table bloat, index bloat, stale statistics and vacuuming to our list.

Tuning postgres connection pools

We use client-side connection pools in our applications, in order to ensure that incoming requests can quickly get a database connection. This way, users don’t have to wait for the application to do a potentially expensive handshake with the database during their request. Most of our applications are Java applications, and we use a mix of HikariCP and c3p0 pools. We’re probably moving in the direction of using HikariCP everywhere, as it seems easier for us to tune.

When we were initially setting up our postgres cluster, we did some research on how to tune it, and one of the parameters we found hard to set was max_connections, which determines how many connections that clients can make to the database server. Most of the documentation and blog posts we found stressed the importance of not setting it too high, but didn’t specify exactly what that meant. The postgres wiki states that it should be at most a few hundred on good hardware. So we set ours to 200 for the initial setup.

We have 11 applications that were going to end up using postgres and we hadn’t done any tuning to their individual connection pools. We decided that we would accept to just deal with the connection pool size later on, and proceeded to set up our database with a conservative max_connections.

There are good resources that document why you shouldn’t use too many connections against your database. Higher amounts of connections might mean that each request gets database access faster, but it does not necessarily mean that their results are ready sooner. A contrived example of how this works:

Imagine you’re in a line in a fast-food joint and you order a burger. There’s only a single employee in this fast-food joint and they need 2 minutes to prepare a single burger. There are 10 people in the line and they all want burgers. The math here is easy, it’ll take 20 minutes to serve everyone. Making this single employee work on many meals in parallell won’t change the total time they need to finish cooking the meals. If the employee spends 1 minute on each meal, all the meals are 50% done and 10 minutes have passed. Nobody has gotten to eat yet. If that employee then finishes up all the burgers at the same time, everyone gets their meal after 20 minutes and can start eating. But if that employee instead finishes one meal at a time, they can start serving people after only 2 minutes. So the best-case and average wait-time improves a lot, and the time and resources consumed is still the same (20 minutes, 10 burgers).

In a database system, your resource constraints are going to be the CPU, memory or disk and those resources cannot be scaled to infinite degrees of parallellism. So you can’t just throw an infinite number of connections at the problem, because at some point you start the underlying system will start queueing requests to the hardware. The disk can’t respond to 10000 disk reads at the same time, there’s going to be some sort of queueing in postgres, the OS, and maybe the hardware. In many cases, having very few connections might improve latency for incoming queries. There’s an interesting post from HikariCP on this topic.

During fall 2016, when we were done migrating most of our applications to use postgres, we started running into problems with our max_connections setting. At the time, we tried to reduce our connection pool sizes in the the applications, but it proved to be really hard to figure out exactly how many connections each application would need. We still haven’t really figured out a great way of tuning this particular parameter and have instead increased our max_connections. But we’ve done some work on improving this situation and started tracking how many connections that are in use in each database on the database side.

We do this by running queries against the pg_stat_activity view, our query to log this data looks something like this:

with states as
    (select datname, client_addr, case
        when now() - state_change < interval '10 seconds' then '10sec'
        when now() - state_change < interval '30 seconds' then '30sec'
        when now() - state_change < interval '60 seconds' then '60sec'
        else 'idle' end
    as stat from pg_stat_activity)
select datname, client_addr, stat, count(*)
from states group by datname, client_addr, stat;

The idea being that if an application has lots of connections that have been idle for more than 60 seconds, it’s probably safe to reduce its connection pool size. We collect this data using telegraf and visualize it into a grafana dashboard, an example graph is shown below:

We haven’t done much with this data yet, but we are seeing significant numbers of idle connections, so we know that we can improve matters.

A really easy way to end up with large connection pools is to give in to knee-jerk reactions to increase pool-size whenever an application can not obtain a connection from the connection pool. That’s the wrong reaction in many cases. Think of it like this: If your application has 10 connections and all your transactions run in 5 milliseconds, that means that each connection can execute 200 transactions per second, giving you a total capacity of 2000 transactions per second. If you run multiple instances of your application, you can multiply that number further. We run each application in at least 2 instances, so in this example, we’d be able to execute 4000 transactions per second. That’s 345,600,000 transactions per 24 hours. None of our databases require a throughput this high.

If your transactions take 50 milliseconds, that number goes down by a factor of 10. So the first thing we should be checking when we’re running out of connections, is if we have an opportunity for a factor of 10 performance optimization of transactions. Many queries can be made orders of magnitude faster by adding missing indexes, restructuring the queries or other trivial fixes. Some queries can not easily be made to be very fast, and in those cases it might make sense to increase the connection pool size. Some queries can also be slow for non-obvious reasons. We have examples of queries that are super-fast most of the time, but early in the morning, they are slow. This is probably because the data is not in any file-system cache because it hasn’t been accessed all night.

What tends to happen when users encounter slow pages is that they will try to refresh, issuing another query and occupying another slot in the connection pool. When a query takes over a minute to execute, 10 users are easily capable of exhausting a modest connection pool between them. The best fix in this case, is to make queries faster. This has the added benefit of making users happier too.

We’re trying to get into the mindset where we think of increasing connection pool sizes as a temporary workaround or a last resort and instead get better at writing efficient queries.

What does your backup do, Sam?

What does your backup do, Sam?

We have been working on our postgres backup routines lately and spent a lot of time convincing ourselves that we have a working setup. This actually took a considerable amount of time, because we had a lot of skepticism after our recent firedrills. We think that skepticism is probably the healthy approach to backup routines anyway. Most people know that backups are important, but putting a lot of work into it has given us reasons to think about what problems that backups solve.

We aready mentioned in a previous blogpost that we have a replication server set up, and hopefully the chance that both the master and the standby go down are pretty low. So in a normal outage, we hope to be able to restore services by promoting the standby to master instead of doing a full recovery from backup, which will take longer. But there are scenarios where our standby won’t save us. The recent heartbreaking gitlab database incident is a good example. Replication had stopped in this instance and the activity was actually to get it up and running again.

Another scenario would be when your standby and master are both in the same data center. In this case, maybe a network outage would make both inaccessible and give you a complete loss of service. At this point, if you had backups, you could create new database servers. Obviously, that requires your backup to also be available off-site.

But there are more subtle type of problems where a standby can not help you. One example would be when a developer introduces a bug in an application that causes it to overwrite valid data with nonsense. In this case, the standby is just going to happily write the changes that the master does and only a backup or a log file could help you recover the lost data. Another example might be a script or person executing a database query that drops a table on the wrong server or in the wrong environment.

We’ve been thinking a little about how we could handle such outages and have some ideas of our ability to handle them, but we haven’t tested ourselves yet. An exercise that we want to do, is to delete a table from a database in QA and attempt to recover that data without losing any transactions after the table is dropped.

For example:

  1. Everything’s working fine
  2. Developer makes mistake, dropping a table
  3. Developer goes to lunch
  4. Users keep creating traffic and transactions
  5. Developer comes back from lunch, notices problem
  6. ???
  7. Developer goes home after work

We think we could manage to sort out the above incident. Our current idea is that we would use our backup and our WAL archive to do a point in time recovery to time 1. We wouldn’t do that on our current master database, because would cause us to lose the transactions between time 1 and time 5. So we’d set up an entirely new database server from the backup instead. From this new instance, we can do a full dump of any relevant tables using pg_dump. Hopefully, we can then import the generated SQL to our master database.

This is a problem you can’t really solve with a standby.

In our setup, a backup job runs pg_backup in the middle of the night. Our master server has an archive_command that it uses to store WAL segments. Both the basebackups and the wal archive are stored on and off-site, so we’ll have access even if the network in our data centers are down for some reason. We have also configured wal_keep_segments, because our backup-tests revealed that a database set up from backup was not able to start streaming replication without it. We don’t fully understand this, as all the required WAL segments are present in archive.

The first step of what a developer could be doing at time 5, would be to set up a new postgres server. They can do that by fetching the latest basebackup from either on-site or off-site, and extract it on a server. The next step would be to create a recovery.conf, setting it up with the correct restore_command to extract WAL segments from the archive, and set the recovery_target_time right before time 2.

Starting this postgres instance should then produce a database server with the same state as what was in the master at time 2. Depending on the amount of WAL segments in the archive, this could take a while.

When it’s done, the developer could use pg_dump, providing the correct --table argument and database names, which should produce a .sql file containing INSERT statements for the missing data.

They can then replay the remaining WAL by setting a later recovery_target_time and restarting the fresh database instance, which should provide them with a database server where the table has been dropped. They can use the fresh database instance to test that importing their dump fixes the problem, or at least doesn’t make it worse. After they’ve tested it and verified that it works, they can do the import on the master server and go home after work. Or that’s the idea.

A recovery is probably harder if the table has been truncated or updated wrongly, which might mean someone has to set up some sort of data reconciliation. But it might still be possible.

Being confident in our backups

One of the things we’ve done in order to be confident in our ability to restore from backup, is to set up an automated job that creates a database instance from backup every day. This database will then connect as a standby to our current master and verify that it can start streaming replication. We throw away this database after it’s been verified that the backup is good. It’s a really nice feeling when the backup-test script posts on slack that it has successfully done a point-in-time-recovery of a backup:

Making this backup-test script was a pretty simple job. It also serves as living documentation of how to do a recovery procedure. And since it’s being run every day, we can be confident that the recovery procedure actually works, unlike some 3-year old recovery procedure documentation on a wiki. A nice side-effect of this is that we actually know how long it would take us to recover from backup.

We haven’t yet completed an automated test for our off-site backup. In principle, it should work exactly the same way, but the wal archive is a different source so the recovery procedure is actually slightly different. We need to do an exercise to determine if we need to write some tooling around this recovery procedure to feel confident about it.

We do want to run firedrills on recovery situations. The table-drop scenario recovery is an example of a recovery job that is more complicated than the test that we run every day, so it would be good to do it manually a few times to verify that we can recover the data in such situations. And it could help us find the limits of what we are actually able to do. Maybe we can do more than we think.

The value of firedrills

In January 2016, we set up a postgres cluster at bring. While doing the initial configuration, we designed what we thought was a pretty decent backup architecture and we set up streaming replication to a hot standby. While configuring the cluster, we made sure to verify our ability to recover from backup, as well as our ability to fail over from the master server to the standby server. This gives us redundancy in case of outages or patching, and it gives us recoverability in case of data corruption, or so we thought. We documented the procedures and all was well. At this point our postgres installation was on 9.4, and we upgraded that to 9.5 to get some new features before we really started using it.

We ran a firedrill in November where we performed a failover in our QA environment. That failover exercise was not a success, because streaming replication had stopped two weeks prior, and we hadn’t picked that up. Cue introducing better monitoring. We set the standby up from scratch again. In January 2017, we scheduled another few fire drills and in the first one, we encountered some difficulties that we learned a lot from. We spent the better part of two work days figuring out these problems, what follows is a short summary of what happened and what we learned.

Our postgres master is set up to archive its write-ahead log (WAL) to a location that is shared with the standby and the standby receives streaming replication from the master in a typical setup.

On performing the failover, we first stopped the postgres master server, ran service postgresql promote on the standby server and updated a DNS record. So far, so good – there was 45 seconds or so of downtime, as seen from the applications and everything came back up fine. Then we started working on setting up the former master as a standby server, so we’d be back to where we started and this is where we started running into trouble.

When we wrote the failover documentation, it seemed that in order to use a former master as a standby, you had to run a pg_rewind.

Our first sign that something was wrong was the unexpected output no rewind required, which we later learned is actually a good thing. Our instructions said that we’d find a recovery.done file in the postgres data directory, which we also didn’t find. Our failover documentation had expected pg_rewind to copy this file for us. At this point, we became aware of the botched failover attempt from November, and went off trying to investigate whether it had created some weird state in the cluster. We discovered that we had 5 timelines on our master server and had no idea what that meant. We created a recovery.conf manually, and attempted to start the former master as a standby. That failed, with errors in the log that the server was unable to retrieve timeline 4. After a while, we discovered this file (00000004.history) in our WAL archive and moved it to the new master. This time, the standby did indeed read timeline 4, and also timeline 5, and started trying to read WAL.

This failed repeatedly because it was attempting to retrieve a WAL segment that the master had already archived - 000000050000009A00000059. We attempted to copy this segment from the WAL archive to the master, but that didn’t help. At this point, we started wondering why the standby wasn’t simply executing its restore_command – after all, the segment was present in the archive.

We noticed that the standby logged that it started streaming from timeline 3, while the master logged that it was writing to timeline 5. We added some logging to our restore_command and discovered that the standby was actually trying to retrieve 000000030000009A00000059 (notice that 3), and not 000000050000009A00000059. These have the same WAL segment numbers, but are on different timelines. At this point we discovered a file named 000000030000009A00000059.partial in our WAL archive and started worrying about data corruption. As a last-ditch attempt for the day, we removed the .partial suffix from the file, the standby recovered it but still couldn’t start streaming replication again.

After hours, we read a bit about postgres and timelines and in the morning we tried to set up the standby by adding recovery_target_timeline='latest' to our recovery.conf. At this point, the standby was able to retrieve 000000050000009A00000059, but no other WAL files from yesterday were still available, our backup job had made a new pg_basebackup and cleaned out the old WAL.

We decided to set up a completely fresh standby from our backup and got our second nasty surprise in 2 days, when it turned out that we were missing the very first WAL segment after the pg_basebackup because of a bug in our backup-script. At this point, we set up a fresh standby from the current running master with pg_basebackup and slept very poorly.

We now know for sure that our problem with setting up the former master as a standby was due to not knowing about timelines. What we needed to know:

  • When a standby is promoted to master, it creates a new timeline
  • It will archive the last WAL segment of the old timeline with a .partial suffix
  • The .partial segment file is harmless and you’re not expected to need it
  • A postgres instance that starts up with a recovery.conf, will by default attempt to stream to the timeline it was on when it was shut down
  • Using recovery_target_timeline lets you control this behaviour
  • Using pg_rewind is not necessary when the former master shuts down gracefully
  • In effect, what pg_rewind does is to discard transactions on the old master, which the new master does not know about

In our case, we were on timeline 3 when we started. Because of the botched exercise in November, we also had a timeline 4, but it wasn’t in use anywhere. So when we promoted the standby, we created timeline 5 which started at the same WAL segment number as timeline 3 ended on.

You can learn even more about timelines, WAL and postgres standby setup in this talk and this talk by Heikki Linakangas, the author of pg_rewind (note: these are kind of scary).

After the firedrill, we made sure to keep WAL segments around for 7 days. We obviously documented recovery_target_timeline. We also set up testing of our backups, so that their viability for backup recovery can be automatically tested every day. We ran a few firedrills after this one, and those have worked out fine and we’re now pretty sure that we have a much better understanding of how WAL works.

We’re super happy that we did this firedrill. Not only do we now know a whole lot more about how postgres works, but we also feel pretty confident that we have a working backup and can sleep much better at night. Firedrills in general are awesome, and this one in particular helped us close a lot of really bad problems in our database setup.