Measuring jvm stats

As a part of creating a new application that has strict performance requirements and needs to deal with large files, one of the teams working on mybring wanted a tool to investigate how much memory their JVM was using. It’s easy to see how much memory a process is allocating on a linux system, but it’s more difficult to find out how much of the allocated memory that it is actually using. The JVM likes to allocate up to its maximum heap size very aggressively, but it might take a long time for it to actually use all that memory.

There are some tools you can use to connect to a running JVM to inspect this, but for applications with these requirements, historical data is going to be very interesting. Previously, we’ve used tools like Java Melody to get information like this. Java Melody is a good tool, but it has some disadvantages. To use it in every JVM application we have, we would need to add code to all of them somehow. Also, for the metric type of data, we prefer to use Grafana which we’ve previously blogged about here. We don’t know if it’s possible to retrieve the data from Java Melody into Grafana.

We already had telegraf running on all our servers. It collects system information, such as free memory, load averages, cpu utilization every 10 seconds, and it seemed natural that we could leverage that to do something similar for JVM metrics. As it turns out, recent versions of telegraf has a very cool plugin for a tool called Jolokia, which is an agent that you can use with the JVM to get an HTTP-interface where you can perform JMX queries. You can use JMX queries to answer questions such as “How much memory is in the heap now?” or “How many threads are there?” – which is exactly what we needed.

Since all of our applications are set up and configured with configuration management, it seemed that going for this approach we could make a one-time effort to get this sort of monitoring for every JVM application we run. We used puppet to add the Jolokia agent to our servers and to add a startup parameter to a couple of our applications to activate it. After that, we added configuration to telegraf to make it perform some interesting JMX queries every time it runsf. We used a template that looks a lot like this:

# Read JMX metrics through Jolokia
[[inputs.jolokia]]
## This is the context root used to compose the jolokia url
## NOTE that Jolokia requires a trailing slash at the end of the context root
context = "/jolokia/"
## List of servers exposing jolokia read service
[[inputs.jolokia.servers]]
name = "<%= @name %>"
host = "127.0.0.1"
port = "<%= @port %>"

[[inputs.jolokia.metrics]]
name = "heap_memory_usage"
mbean = "java.lang:type=Memory"
attribute = "HeapMemoryUsage"

Puppet will put one configuration file for each application into /etc/telegraf/telegraf.d for us. Since telegraf runs on the same server as the applications, it can reach Jolokia on the loopback interface, which means we do not have to expose JMX on an external IP address.

After testing with a couple of applications and getting some visualization up and running, we decided that it was working well and we activated this puppet configuration for most of our applications. The nice thing for us is that we got this monitoring working for all of our applications for pretty much the same amount of work as doing it for a single one.

It’s easy to add more JMX queries to the configuration files, other data that we’ll collect is connection pool data, threads, non-heap memory usage and the amount of loaded classes. One scenario where this will be very useful is when we have a memory leak and want to figure out which change that caused it, or when we run out of Compressed Class Space and want to figure out which change that caused us to start leaking classloaders.

Additionally, the latest version of Grafana lets us set up alerting based on thresholds on the data we collect, which seems like something we’ll end up using. All in all, we’re very happy with the telegraf, InfluxDB and Grafana stack which makes it really easy to add new monitoring to our systems.

Checking out checkpoints

Back and writing about postgres again! Turns out it’s hard to go one week without discovering something new and interesting about postgres. Performance tuning has been an uncharted territory for us till now, we have done minor tweaks here and there but never dove into it head first. Recently we have encountered certain slow inserts and as a part of investigating those, we discovered that running postgres with default settings for checkpoints can lead to spiky I/O. Namely xlog triggered checkpoints may force a timed checkpoint to complete very quickly, leading to bursty writes. You can read more about those here. With log_checkpoints = on in your settings, a normal, time-triggered checkpoint will look like this:

2017-03-14 18:26:36.196 CET db= user= remote_host= LOG: checkpoint starting: time
2017-03-14 18:30:59.757 CET db= user= remote_host= LOG: checkpoint complete:
wrote 20176 buffers (1.9%); 0 transaction log file(s) added, 0 removed, 10 recycled;
write=262.335 s, sync=1.202 s, total=263.560 s; sync files=416, longest=0.168 s,
average=0.002 s; distance=140722 kB, estimate=351122 kB

An xlog-trigged checkpoint, which you want to avoid, look like this:

2017-03-15 16:07:14.646 CET db= user= remote_host= LOG: checkpoint starting: xlog
2017-03-15 16:11:44.345 CET db= user= remote_host= LOG: checkpoint complete:
wrote 53503 buffers (5.1%); 0 transaction log file(s) added, 0 removed, 15 recycled;
write=269.294 s, sync=0.385 s, total=269.698 s; sync files=427, longest=0.099 s,
average=0.000 s; distance=400794 kB, estimate=400794 kB

We have telegraf setup which reads pg_stat_bgwriter every 10 seconds, helping us immensely in locating the spiky checkpoint activities. The picture at the top of the blog post is a graph of the derivative of the values present in this view. The green line is the amount of buffers that had to be flushed by the client. The blue indicates the work done by the bgwriter and yellow represents the checkpointer process. Normally, you’d want the bgwriter to do more work than the clients.

Timings of the slow inserts coincided with spiky checkpoints and we also observed high I/O rate at these points in time. Therefore it seemed plausible that inserts were slow because of spiky checkpoints. Something you can do to make checkpoints less spiky is to increase checkpoint_timeout and max_wal_size. Following the advice from the blog post linked above, we conservatively increased the checkpoint timeout from the default of 5 minutes to 15 minutes and max_wal_size from the default of 1GB to 5GB. Furthermore we tried leverage the bgwriter to be more aggressive to avoid client connections flushing shared buffers. In postgres documentation, and in the pg_stat_bgwriter view, a client connection is referred to as a backend.

You can read more about shared buffer cache here. A short summary is that everything postgres reads from disk has to be read into a shared buffer. If the shared buffer has been modified it is said to be dirty. Before reading into a dirty shared buffer it has to be cleaned, which is accomplished by flushing it to disk. This can be done by a client connection if there are no clean shared buffers available, or it can be done by the checkpointer or the bgwriter. It is preferable that the bgwriter does this rather than the clients, because no user is ever waiting on the bgwriter.

With the default settings the bgwriter will run upto 5 times in a second and is allowed to flush upto 100 pages each time it runs, with one page being 8192 bytes this means it can flush upto 4MB per second. In our database, this was dwarfed by the amount of flushing done by client connections. So, we doubled the frequency of the runs and doubled the max amount of pages to clean per run which allows the bgwriter to flush 16MB per second.

This graph shows the bgwriter being a lot more active and checkpointer being a lot more even, which is exactly what we hoped for. Note that the Y-axis on this graph stays below 40MB per 10 seconds. In the graph from before we tuned the bgwriter, the Y-axis goes to about 80MB because there’s a high spike in it.

Unfortunately this did not fix our slow inserts, even though our disk activities are now much more even. We no longer think that spiky disk I/O was the reason for the slow inserts. In hindsight this should have been obvious, as we were only observing the slow inserts on certain tables. If spiky I/O was the reason for that, we would have seen it affect other tables as well. Our new lead is the fact that all affected tables have gin indexes. We found some email threads discussing the performance implications of how gin fast update works. This is an interesting discussion of the trade off that the fast update does. It makes inserts faster on average, but occasionally a few inserts will be slow, if the autovacuumer has not cleaned the gin pending clean list. Our table does not get vacuumed often enough because its workload is mostly inserts. You can read more about vacuuming here. For now, the easiest fix seems to be to disable gin fast inserts or set up a frequent vacuuming scheme for the affected tables. We’ll be back with the results on some other Friday!

Metrics at Mybring

At Mybring, we collect a lot of metrics and we think they are incredibly useful and can be a lot of fun. At the end of the day, it’s essential to understand our systems and users. Collecting metrics enables us to take informed decisions - which are based on facts and numbers rather than speculations. We get an overview of the health of our servers, changes in system behaviour are reported over a period of time, which helps us in discovering problems before they happen.

Collecting all the actions done by your systems and developers is really powerful. Imagine you have a dashboard that shows you the page load times for all the pages that recently got slower together with all your deploys. It would be really easy to track down which change that introduced a performance problem. This capability requires you to record both your deploys and your performance data. We don’t know exactly which data about our systems that will be crucial to know, so we err on the side of caution and record almost everything we can think of. This excellent presentation by Coda Hale goes into many other good examples as to why you should be measuring things.

Other than tracking our application and servers, there are a bunch of other data which are fun to have – like tracking pull requests and how often we are deploying/releasing to production. It’s cool to see the trend of open pull requests, size of pull requests per repository to track down workflow issues. We can also use these numbers to check whether process changes make us more productive and faster.

The picture at the start of this post is a mosaic of graphs from a variety of systems that we have at Mybring. Today as we are mostly writing about metrics, we will focus on InfluxDB, Telegraf and Grafana.

InfluxDB is a time-series database. It is great for storing massive amounts of numerical data that is associated with timestamps. For example we store load averages, cpu utilisation, socket state, memory and many other things. We record this data for 40 servers, every 10 seconds. In one year the InfluxDB data on disk is only 35GB which means we will never have to delete this, which is nice. To read more about how this is possible you should check this out.

We have telegraf installed on every server, collecting metrics every 10 seconds. To instrument applications, we use a home brewed metrics microlibrary to send data asynchronously to InfluxDB. We do this so that requests to our applications won’t fail if we can’t write to InfluxDB for some reason. Additionally our scripts that we use to deploy and release applications, also send data to InfluxDB. This lets us find out how fast our deploys are, how often we do rollbacks and a number of other interesting things.

Grafana is a tool for visualization of timeseries-data, vis-à-vis server metrics and application metrics. Most of the graphs from above picture are screenshots of grafana. It integrates very nicely with InfluxDB and also supports a number of other backends such as elasticsearch and graphite.

We have created a lot of different dashboards in grafana, the one pictured above is a big picture kind of look on our servers. We use it to investigate performance on the OS level. We have stats about sockets on this dashboard because it is an important resource to track, since it is difficult to debug servers that are out of sockets.

Dashboards are really easy to create with grafana, so we have a lot of them. Some of them are created while debugging production issues. Having all the data in the same view makes it much easier to correlate symptoms with underlying problems. It also supports annotating dashboards with text, which makes these dashboards that are created during outages useful for real time status updates, postmortems and investigating how we could have detected the problem before it occurred.

There are tools available today which makes it really easy to collect and investigate a lot of metric data so there is really no reason not to.

Postgres auto_explain thresholds

Back and blogging about postgres again! We do other things too, it’s just that postgres is a really cool product that we’ve had time to learn a lot about recently and it’s interesting to write about.

When we configured postgres, we set up logging for slow queries, but we set a really high log_min_duration_statement, going for 60 seconds. We’ve been running with that configuration ever since, but we’ve decided that 60 seconds is way too high to be useful, since nothing should ever trigger that. We’ve been wanting to lower it so we can be better at keeping track of slow queries. The first step you’re probably going to do when optimizing a query is to find the query plan that was executed.

Postgres comes with a neat module called auto_explain which can be configured to log the query plans for queries that run for more than auto_explain.log_min_duration milliseconds. We wanted both to lower log_min_duration_statement, and at the same time set up auto-explain to the same threshold. We weren’t sure exactly of how low is too low. The reason why this matters is mostly to avoid spamming logs and creating too much I/O on the database server. The overhead of auto_explain will stay low, as long as you do not enable auto_explain.log_analyze. The documentation warns you about this, saying the following:

When this parameter is on, per-plan-node timing occurs for all statements executed, whether or not they run long enough to actually get logged. This can have an extremely negative impact on performance. […]

We tried to investigate exactly how large this “extremely negative impact” actually is, but didn’t find too much. People we chatted with said slowdown by about a factor of 2 could happen. We honestly don’t know what that would do to the database server and prefer not to find out. So we chose to not enable auto_explain.log_analyze at all.

There’s another postgres extension that was really helpful when setting our thresholds. pg_stat_statements will gather statistics about your most common queries for you, if you CREATE EXTENSION PG_STAT_STATEMENTS in your database. This view contains several really interesting statistics about your queries, for example how often they are called, how long they take to execute (mean, max, min, standard deviation), how often they have cache-misses and how many temporary files they generate (ideally none!)

We decided we could use this to obtain a ballpark estimate of how many queries different thresholds for auto_explain would catch. We really don’t want to be logging many thousands each day, as that would make it difficult to focus on the big wins, so we’d prefer to start with a lenient threshold and then lower it as we improve queries. Trying to query against pg_stat_statements also gave us an opportunity to practice some pretty cool SQL.

We based our choice on mean_time and calls (total amount of query executions) and investigated what percentage of our total queries that had which average running time. We started by classifying each query in pg_stat_statements into different buckets depending on its mean_time:

select calls, case
              when mean_time > 60000 then 60
              when mean_time > 30000 then 30
              when mean_time > 10000 then 10
              when mean_time > 5000 then 5
              else 0 end as bucket_time
from pg_stat_statements;

This will classify each query according to its mean_time, giving it the bucket_time of 0, 10, 30 or 60 seconds, which is the lower bound of the bucket we put it in. We can build on this query to aggregate some useful information:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements)
select sum(calls) as calls, bucket_time from query_times group by bucket_time;

The with statement here is called a Common Table Expression and you can read more about it here. It’s super useful for structuring queries nicely to make them readable, and it helps with building up queries one part at a time. The query above produces an output much like:

   calls    | bucket_time
------------+-------------
        688 |          30
       1131 |          60
       4166 |           5
 6065699764 |           0
       1483 |          10

That’s already useful and will give us our ballpark estimate. We know that it’s probably safe to set our threshold to 5 seconds for a start, because that won’t catch many queries (688 + 1131 + 4166 + 1483). But wouldn’t it be cool to calculate some percentages so we can see how many percents of queries that fall into each bucket? We can do that with window functions:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements),
  grouping as
    (select sum(calls) as calls, bucket_time
    from query_times group by bucket_time)

select bucket_time,
    100 * calls / (sum(calls) over ()) as percent,
    calls,
    rank() over (order by bucket_time)
from grouping;

The sum(calls) over () part of the query will sum up calls for each column in the window, which in our case is the entire result-set. rank() will number the rows according to the ordering of bucket_time. In total, this produces an output much like:

 bucket_time |          percent           |   calls    | rank
-------------+----------------------------+------------+------
           0 |        99.9998768942781880 | 6066323073 |    1
           5 | 0.000068674134583396087978 |       4166 |    2
          10 | 0.000024446409406427364011 |       1483 |    3
          30 | 0.000011341287708443713041 |        688 |    4
          60 | 0.000018643890113735231758 |       1131 |    5

Wouldn’t it be cooler still if we could accumulate the calls and percentages according to ranks, such that each bucket also contains all the calls from the ranks that are lower than it? That way, we can say that 99.9999999% of all queries complete in less than X seconds. We can do that joining the previous query with a join-condition on the rank:

with query_times as
    (select calls,
        case when mean_time > 60000 then 60
             when mean_time > 30000 then 30
             when mean_time > 10000 then 10
             when mean_time > 5000 then 5
             else 0 end as bucket_time from pg_stat_statements),
  grouping as
    (select sum(calls) as calls, bucket_time
    from query_times group by bucket_time),

  stats as
    (select bucket_time,
      100 * calls / (sum(calls) over ()) as percent,
      calls,
      rank() over (order by bucket_time)
     from grouping)

select
    s1.bucket_time,
    s1.calls,
    s1.percent,
    sum(s2.calls) as accumulated_calls,
    sum(s2.percent) as accumulated_percent
from stats s1
    inner join stats s2 on s1.rank >= s2.rank
group by s1.bucket_time, s1.calls, s1.percent order by s1.bucket_time;

The result will look something like this:

 bucket_time |   calls    |          percent           | accumulated_calls |     accumulated_percent
-------------+------------+----------------------------+-------------------+------------------------------
           0 | 6066780909 |        99.9998769035684793 |        6066780909 |          99.9998769035684793
           5 |       4166 | 0.000068668952024004314466 |        6066785075 |  99.999945572520503304314466
          10 |       1483 | 0.000024444564534709169072 |        6066786558 |  99.999970017085038013483538
          30 |        688 | 0.000011340431827295959758 |        6066787246 |  99.999981357516865309443296
          60 |       1131 | 0.000018642483134697282684 |        6066788377 | 100.000000000000000006725980

Notice how accumulated_percent is larger than 100%? We think this is a combination of rounding errors with floats, and the fact that we’re actually running queries right now, which will add new entries to pg_stat_statements while we’re executing. Anyway, the acumulated columns should show how many queries that are at least this fast, on average. Choosing different bucket sizes is easy, only the first common table expression needs to be modified.

The result gives us some pretty good ballpark figures for setting slow query thresholds.

You should vacuum your databases

Postgres has a great multiversion concurrency control implementation (MVCC). MVCC a very useful concept, which allows you to do things like roll back transactions that do DDL changes. The goal is to allow for consistent point in time views of the database, while still allowing multiple transactions to read and write the database at the same time – without having to use expensive locks.

This means that when you update data in your Postgres database, the existing data is not modified. Instead, a new version of that data is added to the database. This is because it’s not safe to directly modify existing data, because other transactions could be reading it. At some point, the old version of the data is no longer going to be interesting for anyone, which means it’s time to clean it out. That can be done by issuing VACUUM in your database. Normally, you don’t have to worry about doing that, because postgres has something called the autovacuum launcher. It will periodically launch an autovacuum process, which will vacuum tables that it believes has a decent amount of garbage (“dead tuples”). You can read more about autovacuum here.

So far we’ve used default settings for autovacuum, but we recently started investigating some slow queries and found out that the tables had not been autovacuumed or analyzed for a really long time (7 weeks). Manually vacuuming the affected tables cleared up lots of unused items and we’re paying close attention to check if that improves query performance. Below is some output from VACUUM VERBOSE ANALYZE:

There were 1829170 unused item pointers.
DETAIL:  1845 index row versions were removed.
49 index pages have been deleted, 92195 are currently reusable.

Investigating some more tables in the database revealed that most tables were being properly vacuumed and analyzed frequently, and the tables we were having issues with just have very different types of workloads. They contained many millions of rows and data was almost always only added to them, very rarely updated or deleted. It doesn’t seem like just doing INSERT on tables will actually cause autovacuuming to occur. In terms of accumulating garbage, that’s fine, because INSERT creates very little garbage. But the auto-vacuum will also normally take care of running ANALYZE, which will update the statistics that the query planner uses to make efficient query execution plans. Often, the direct consequence of having outdated statistics will be that the query planner will select suboptimal query execution plans. In bad cases, this can lead to a high performance degradation. In our case, we did see that the query planner would select better execution plans for a few hours after running ANALYZE, but it’s now back to picking the suboptimal plan.

That’s because we didn’t have much success with tuning the autovacuum parameters for the table to make autovacuum analyze it. We found this blogpost very helpful in understanding how autovacuum parameters should be set. It looks like we’ll need to set up a job to analyze the tables in some other way. The reason we think this is an interesting topic, is that we encounter this sort of problem as application developers, not as DBAs (we’re not DBAs). We only encountered this problem because the work-load for 2 particular tables in our cluster with 11 databases is very different from the rest, so autovacuum doesn’t really do a great job here. In very rare cases, we will batch-update a significant amount of rows (tens of thousands), but not significant enough to trigger autovacuum (because the tables have many millions of rows).

This is just one more example of the broad scope of knowledge a team needs to have in order to develop efficient, modern web-applications. We see cases where we need to know about linux wizardry to hunt down socket leaks caused by bad ssh config, flexbox incompatibilities between different versions of the same browser, css minification differences between different versions of the same minifier, double-submits in javascript causing database deadlocks and many other different odd pieces of knowledge and trivia. Now we’re adding table bloat, index bloat, stale statistics and vacuuming to our list.