At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.
One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.
We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:
We conclude with the actions we took to prevent this happening again.
What we saw
One day we started seeing unusual query performance from several sources: both async workers and API requests seemed to take longer to respond than was normal. Examining the problematic database queries, the common factor was that each query touched our payment transitions table (almost half a billion rows) and associated relations.
Looking further, we saw a concerning number of queries that were very long-running that normally execute in under 50ms. Now suspicious that we'd hit a poor query plan, it was time to find an exemplar query and dig in:
SELECT *
FROM payment_transitions
JOIN payments
ON payments.id = payment_transitions.payment_id
WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
LIMIT 1;
How many payments, how many transitions?
Debugging a query plan almost always follows the same pattern: take time to understand the query, identify why the plan you received is bad, then hypothesise an ideal plan that would be fast. That new plan often requires an index that is yet to be created. On the other hand, perhaps a fast plan doesn't exist for this query. Whatever the outcome, it's key to every step that you understand the shape of the data you're querying.
Our query references two tables, payments
and payment_transitions
. In this system every payment has states it can transition through, and each of those states is represented as a row in the payment_transitions
table. We'll be filtering on a foreign key of payment_transitions
called payout_id
which marks that transition as having been included in a payout.
machines in Postgres, where each transition is a new row.
Approximately 20% of our payment transitions will be marked with a payout_id
, and there are approximately 20 payment transitions per payout. We can reasonably expect the number of payout_id
values to grow linearly with the size of our payment_transitions
table.
Using approximate figures, if we have 350m payment transitions, we can expect 70m to be marked with a payout_id
, and there would be almost 3.5m distinct payout_id
values in the payment_transitions
table. Finally, we have an index on the payout_id
column that looks like this:
CREATE INDEX index_payment_transitions_on_payout_id
ON payment_transitions
USING btree (payout_id);
This should provide enough context for us to properly evaluate each potential plan for this query.
EXPLAIN the plan
Using the query we'd pinpointed as problematic, we ran an EXPLAIN
in a Postgres prompt to display the selected query plan.
EXPLAIN
SELECT *
FROM payment_transitions
JOIN payments
ON payments.id = payment_transitions.payment_id
WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
LIMIT 1;
QUERY PLAN
Limit (cost=1.14..21700.47 rows=1 width=262)
-> Nested Loop (cost=1.14..58045700.29 rows=2675 width=262)
-> Index Scan using payment_transitions_pkey on payment_transitions (cost=0.57..58022604.77 rows=2688 width=262)
Filter: (payout_id = 'PO00123456789Z'::text)
-> Index Scan using payments_pkey on payments (cost=0.57..8.58 rows=1 width=14)
Index Cond: ((id)::text = (payment_transitions.payment_id)::text)
(6 rows)
This query includes a join operation between our payments
and payment_transitions
table. Formally, a relational join is an operation on two sets - R and S - which will produce a result consisting of all combinations of tuples (tuple means a single row result) in R and S that are equal under a particular matching condition.
When joining two tables, Postgres employs one of three strategies: merge, hash or nested loop. The strategy Postgres has chosen for our query is nested loop, the most naive of join strategies. Nested loop joins will iterate over every tuple in payment_transitions
and for each tuple scan the payments
table for tuples that match the join condition, which in this case is payments.id = payment_transitions.payment_id
. Our result will be all the tuples that satisfied our condition.
Looking at our plan, we're using the payment_transitions_pkey
to select each transition tuple and for each transition that has a matching payout_id
, we'll use an index lookup into the payments
table to perform the join. The advantage of this query plan is that the first matching row we find using the payment_transitions_pkey
index is guaranteed to match our query ordering constraint (ORDER BY payment_transitions.id
), so we can halt execution at this point as we only require a single tuple (LIMIT 1
).
payment_transitions
tuples in order of payment_transitions.id
. This is why the first result from scanning our transitions using this index is guaranteed to have the minimum id
value.
Sadly, this query plan is not going to work well for us. Recalling the underlying distribution of this data, for every payout we expect there to be approximately 20 matching transitions. If we assume that these matches are evenly distributed throughout the payment_transitions_pkey
index (a fair assumption for query planning purposes) then we expect to scan 1/20th of the table before we find our first match.
with respect to the payment_transitions.id
is going to be terrible for us.
Our real world example happens to be queries for recently created payout_id
values, and given the payment_transitions.id
is a monotonically increasing sequence, we can expect our matching transitions to be right at the end of our scan.
This is an example of how reasonable assumptions in theory can lead to pathological data access patterns in practice.
At 350m rows, this amounts to 17.5m rows we need to scan, or about 20GB of data. This plan will never match the performance we expect from this query, so something has gone deeply wrong.
query slow but will have a large performance impact on the rest of your database. Scans like these are likely to read old data that is not currently in our page cache, causing eviction of pages that are needed for other on-going queries.
It's worth bearing this in mind when your database has strict performance requirements and depends on hot data being cached to meet them.
What did Postgres expect
The calculations we just performed are very similar to how Postgres evaluates query plans. In fact, the statistics we've been quoting are tracked and updated regularly by Postgres through the auto-analyze process, and are known as the statistic values n_distinct
and null_frac
.
Having a measure for each column of the number of distinct values (n_distinct
) and fraction of rows for which the column is null (null_frac
) enables Postgres to compute the expected number of rows returned for a given query as approximately row_count * (1 - null_frac) / n_distinct
.
histogram bounds, allowing the computation to take into account statistical outliers. In this example, we can safely ignore these histogram bounds because the most common values cover only a small percentage of the table.
Looking at the explained output of our plan:
-> Index Scan using payment_transitions_pkey on payment_transitions (cost=0.57..58022604.77 rows=2688 width=262)
Filter: (payout_id = 'PO00123456789Z'::text)
We see that Postgres expected that 2688 payment transitions would match our filter condition on payout_id
. Assuming this is a typical payout (it doesn't appear in Postgres' most common values) then we've way over-estimated the number of transitions attached to the average payout, which should be about 20. When we look at our statistics for this column, we start to see some concerning numbers:
postgres=
attname | n_distinct | null_frac
payout_id | 25650 | 0.81
Plug these values into our formula from above and they imply that the number of payment_transitions
we'll find that match our payout_id
is 350m * (1 - 0.81) / 25650 = 2592. We know from our manual calculations that the average payout is associated with 20 payment_transitions
, so Postgres' estimate of the number of distinct payouts is incorrect by two orders of magnitude (25k vs 3.5m). Such an error will prevent Postgres from making sane decisions when comparing plans.
Our ideal plan
Our ideal plan would be to fetch all matching payment transitions for our payout, then (knowing this will be a small number) perform an in-memory sort on the results, returning the transition with minimum ID. The initial fetching of matching transitions would be fast due to our index_payment_transitions_on_payout_id
index.
The plan (with correct row estimations) would look something like this:
QUERY PLAN
Limit (rows=1)
-> Sort (rows=20) Sort Key: payment_transitions.id
-> Nested Loop (rows=20)
-> Index Scan using index_payment_transitions_on_payout_id on payment_transitions (rows=20)
Index Cond: (payout_id = 'PO00123456789Z'::text)
-> Index Scan using payments_pkey on payments (rows=1)
Index Cond: ((id)::text = (payment_transitions.payment_id)::text)
Materializing all matching transitions for most payouts will be quick and the subsequent sort cheap, as on average there will be so few of them. This is what we want the query plan to produce but our statistics meant Postgres vastly overestimated the cost of our sort, opting for the much more expensive primary key scan.
What went wrong?
Postgres' planner has made a choice to use a query plan that could potentially require far more data that the alternative, given it believes the chance of an early exit - finding a payment_transition
with a matching payout_id
- will be high. We can see in the planner code exactly why this has happened and how the decision was made:
LimitPath *
create_limit_path(
PlannerInfo *root, RelOptInfo *rel, Path *subpath,
Node *limitOffset, Node *limitCount,
int64 offset_est, int64 count_est)
{
...
if (count_est != 0)
{
if (subpath->rows > 0)
pathnode->path.total_cost = pathnode->path.startup_cost +
(subpath->total_cost - subpath->startup_cost)
* count_est / subpath->rows;
...
}
}
This code is taken from src/backend/optimizer/util/pathnode.c
which contains utilities to prune and optimise possible query plans (paths). This extract is where we calculate the cost of a plan that has a limit, as this will mean some optimisations might be possible.
The most impactful optimisation for limit queries is to exit early. If your query has a very small limit, then Postgres should stop execution as soon as the limit is satisfied rather than continue scanning all your data. When a limit is combined with an ordering, exiting early becomes possible for candidate plans provided they find rows an the order that matches the sort clause.
Recall that our query is asking for a single (LIMIT 1
) payment_transitions
row sorted by id
. If Postgres finds our payment_transitions
by searching the payment_transitions_pkey
index then we'll find our results in order of their id
, meaning the first tuple we return that matches our query's WHERE
clause will satisfy our query and we can exit early.
The possibility of early exit means we can discount the total cost of any already-sorted plans by count_est / subpath->rows
, as - assuming our matching tuples are found evenly distributed across our search space - this is the fraction of the plan we'll need to execute before we've produced all tuples we require.
In our case, our count_est
is 1, and our subpath->rows
is high (2592) due to our underestimation of n_distinct
for payout_id
. Small numerator and large denominator means the discount is huge and is why the nested join through the payment_transitions_pkey
index was chosen as the best plan.
How we fixed it
As soon as we realised our statistics were causing such a poor query plan we re-ran an analyze to cause Postgres to resample. We had to do this a few times before our plans got better, which hints at the more concerning root cause of this problem.
When Postgres runs an analyze, it takes a sample of the table to use for generating statistics. There are several subtleties around how Postgres samples the table that can impact the accuracy of the tables statistics, and at present it's not possible to solve all of them.
Sample size
The Postgres GUC (Grand Unified Configuration) variable default_statistics_target
defines the default sample size Postgres uses for computing statistics, as well as setting the number of most common values to track for each column. The default value is 100, which means "take samples of 100 * 300 (magic number) pages when running an analyze", then sample randomly from amongst the rows included in these pages.
But how large a sample is large enough? The n-distinct estimator used by Postgres is from IBM Research Report RJ 10025 (Haas and Stokes), where the authors discuss the bias and error that is expected from the estimator given various sample sizes and underlying data characteristics.
In their analysis of the estimator, they note that it has been proven by Bunge and Fitzpatrick (Estimating the Number of Species: A Review) that unbiased estimators do not exist when the sample size is smaller than the count of the most frequently occurring value in the population. The bias in these estimators is significant (anywhere up-to 80%) and small sample sizes will cause the bias to increase.
The estimator bias is always negative, meaning we estimate fewer distinct values than are actually present - this could explain the underestimation leading to our query plan malfunction. There can be anywhere up to 100k payment_transitions
with the same payout_id
value, so at minimum we should sample as many transition rows to guarantee we'll find more than one distinct payout_id
value. As ~80% of payout_id
s are NULL, we require 100k / (1 - 0.8) = 500k rows, or 1666 as a statistics target (recalling that Postgres multiplies your statistic target by 300 to find the desired number of samples).
We bumped the statistics target for this column like so:
ALTER TABLE payment_transitions
ALTER COLUMN payout_id SET STATISTICS 1666;
And repeatedly ran analyzes, checking the n_distinct
value at each run. While the values were slightly better than what we'd seen before we continued to see large variation and massive underestimation. It wasn't until we bumped our target to 5000 that the value became stable.
Not all samples are created equal
We expected that a moderate bump (say to 500) of our sample size would produce markedly better results than the default 100, but this turned out not to be the case. Careful thought about how Postgres generates our random sample lead to the conclusion that we were unduly biasing our estimator by taking a fair, random sample from a statistically biased selection of pages.
Postgres generates its samples in a two stage process: if we want to collect a sample of 100k rows, we'll first gather 100k pages and then collect our sample from those pages. It is not the case that every table tuple has the same probability of appearing in our sample, as we're confined to the pages we selected in our first pass. Ideally this shouldn't be a problem, assuming column values are distributed independently amongst pages, but in our case (and we suspect many others) this is not true.
https://github.com/postgres/postgres/blob/REL_11_2/src/backend/commands/analyze.c#L1029-L1033) for an explanation of Postgres' two-phase sample strategy. The comment also mentions some flaws of this approach.
Our system creates all payment_transitions
for the same payout_id
in one sweep. The payment_transitions
table is mostly append-only, so Postgres is prone to place all those new transitions physically adjacent to one another, sharing the same pages. If we take our random sample from a restricted set of pages we've vastly increased the probability of sampling a value multiple times in comparison to selecting from the entire table.
We can confirm this bias by using Postgres table samples to compute our statistics with a system strategy (approximates our analyze process) vs statistically fair sampling with bernoulli. A statistics target of 500 means we'd sample 150k (300 * 500) rows, which is 0.043% of our table. Using this sample size and comparing the two sampling methods we can see a stark difference:
postgres=# select count(*) as sampled,
count(distinct(payout_id)) as unique
from payment_transitions tablesample system(0.043);
sampled | unique
---------+--------
153667 | 11029
postgres=# select count(*) as sampled,
count(distinct(payout_id)) as unique
from payment_transitions tablesample bernoulli(0.043);
sampled | unique
---------+--------
153667 | 25351
Sample everything?
Everything we've explained so far might have you asking why you would ever not use the maximum sample size. If our sample is the whole table then we can't have statistical bias, right?
Higher statistic targets will increase the amount of time required to perform an analyze on a table. The Postgres auto-vacuum daemon is constantly triggering ANALYZE
's in response to database activity, and the ACCESS SHARE
locks they take on their table can block incoming ALTER TABLE
commands. This can be a problem if you frequently want to change table structure, or don't take care to timeout the lock to avoid blocking other incoming queries.
around lock timeouts to avoid causing disruption when taking locks in Postgres
One non-obvious consequence is how this affects upgrading your database. Postgres major upgrades will reset all statistic values. If your database depends on good statistics to perform (hint: it probably does!) then a full ANALYZE
will be required before you can accept production traffic. The longer your ANALYZE
takes, the longer you'll be down.
For this and other reasons you may want to hold off from dialing the sample size to max. With the exception of the n-distinct estimator we've covered in this post, our experience has shown Postgres to make very good decisions with normal sample sizes. More varied data shapes than our payment_transitions
table may be better suited to restructuring than hacking the statistics engine.
large, heterogenous datasets. When all the rows look similar there will be fewer candidate plans and less chance of a disasterous mis-step.
Conclusions
Postgres has a fantastic query planner that can help scale a database far beyond the size of the average organisation. It also provides great tools and documentation that can help you deal with most performance issues that arise with growth, and the flexibility via configuration values to handle most use cases.
That said, the heuristics powering the query planner can cause Postgres to make decisions that flip performance on its head. In this production issue we saw a normally fast query degrade in a spectacular fashion, and it was only after peeling back a few layers that we began to understand why it happened.
We've covered some ways you can configure Postgres to adapt beyond the size of the average database, and explained some trade-offs involved. We choose to increase our statistics target to provide better information to the query planner, but we're also pushing for more careful thought around data access patterns to reduce the complexity of each query, making it less likely for the planner to make a mistake.
Hopefully this is a useful case-study for those who want to learn more about the query planner. If you have feedback or questions about this article we'd love to hear from you @GoCardlessEng.
from Hacker News https://ift.tt/pij103x
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.