Postgres Plan Flip Diagnosis
A query that ran in 12 ms yesterday takes 4.6 seconds today. The schema is unchanged. The data volume looks normal. The application code has not been deployed. What changed is theplan: the Postgres planner picked a different join order or scan type, and that choice is now wrong. Plan flips are some of the hardest performance regressions to track down precisely because the artifact that changed lives inside Postgres rather than in a Git commit. This guide is the diagnostic loop: detect, isolate the trigger, validate the fix.
What causes a plan flip
The planner picks a plan by estimating cost. Cost is a function of three inputs: schema (indexes, constraints, partitioning), statistics (column histograms, distinct counts, correlation), and parameters (the actual values bound to a prepared statement). Change any one and the chosen plan can shift. The five most common triggers, ordered by frequency:
- Statistics refresh. Autovacuum or a manual
ANALYZEupdates a histogram, the new histogram changes a selectivity estimate, the planner switches join order. Most plan flips are this. - Parameter sniffing. A prepared statement built its generic plan around a frequent parameter value. When called with an outlier, the same plan does the wrong thing.
- Data-volume crossing. A table grew past the point where the planner's seq-scan vs index-scan estimate tips. Often hits 1–6 months after launch.
- Catalog change. A new index was added; the planner now sees a cheaper access path and chooses it. The new path may not actually be cheaper — covering different columns or being un-warm in cache.
- Postgres version upgrade. Cost-model tweaks between major versions change which plan wins by a hair. Always re-baseline EXPLAIN output after a major.
Detect: capture EXPLAIN over time
You cannot diagnose what you have not recorded. Three tools, in increasing capture fidelity:
pg_stat_statements: structural change as a signal
pg_stat_statements does not store plans, but plan flips produce a sudden change in the ratio of shared_blks_read to calls — a query that switched from index to seq scan reads orders of magnitude more pages per call. Snapshot every 5 minutes and alert when the ratio doubles:
WITH base AS (
SELECT
queryid,
sum(calls) AS calls,
sum(shared_blks_read) AS reads,
sum(shared_blks_hit) AS hits
FROM pg_stat_statements
GROUP BY queryid
)
SELECT
queryid,
calls,
reads,
hits,
round((reads::numeric / nullif(calls, 0)), 2) AS reads_per_call
FROM base
WHERE calls > 100
ORDER BY reads_per_call DESC
LIMIT 20;auto_explain: capture plans for slow queries
auto_explain is the workhorse. Enable it in postgresql.conf with a duration threshold:
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = 200ms
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_timing = on
auto_explain.log_nested_statements = on
auto_explain.sample_rate = 0.1 # 10% of slow queriesEvery query above 200 ms (sampled at 10%) now gets its full EXPLAIN ANALYZE logged. Ship the logs to a log store and you have a searchable history of plan choices per fingerprint. When a query gets slow, look at the most-recent entry for that fingerprint and the most-recent entry before it got slow — the diff between the two is the plan flip.
pg_store_plans / TuneStack PlanFlip
The third-party pg_store_plans extension stores plan hashes perpg_stat_statements queryid. When the hash for a queryid changes, you have a plan flip — directly. TuneStack's PlanFlip detector does the same with an in-house plan fingerprint plus diff visualization, so the operator does not have to read raw EXPLAIN trees side-by-side.
Isolate: what triggered the flip
Once you have the before/after plans, work top-down:
1. Diff the row estimates
Look at the (rows=...) annotations on every node. If the estimate at the bottom of the tree changed by more than a factor of 10, the trigger was a statistics refresh. Verify by checking when last_analyze or last_autoanalyze last ran:
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';If last_autoanalyze aligns with when the query went slow, the trigger is statistics. The fix is usually either extended statistics (CREATE STATISTICS for correlated columns) or a more frequent analyze cadence on that table.
2. Check for parameter sniffing
If the query is a prepared statement, the plan is the generic plan built on the first few invocations. Force a custom plan per call:
SET plan_cache_mode = 'force_custom_plan';
-- inside the application connection
PREPARE q (int) AS
SELECT * FROM orders WHERE customer_id = $1;
EXPLAIN ANALYZE EXECUTE q (12345); -- frequent value
EXPLAIN ANALYZE EXECUTE q (999999); -- outlierIf the outlier plan is good but the cached generic plan is bad, parameter sniffing is the trigger. Solutions, in order of invasiveness: cycle the prepared statement after deploy, switch the cursor to literal SQL for the outlier path, or useplan_cache_mode = 'force_custom_plan' for the affected statement.
3. Look for catalog change
pg_stat_user_indexes and pg_index tell you when an index was created. A recent indisready=true,indisvalid=true transition on the same relation is a candidate trigger:
SELECT
c.relname AS index,
i.indrelid::regclass AS table,
pg_size_pretty(pg_relation_size(c.oid)) AS size,
s.idx_scan,
s.idx_tup_read
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
LEFT JOIN pg_stat_user_indexes s ON s.indexrelid = c.oid
WHERE i.indrelid::regclass::text = 'orders'
ORDER BY c.relname;An index that the planner is using (high idx_scan) but that was created the same day the query got slow is a primary suspect. Disable it with HypoPG in a session and re-run the query — if the plan reverts, you have confirmed.
Validate: prove the fix without merging it
Never push an ANALYZE-frequency change or a new index blindly. Validate first:
- For statistics fixes: run
ANALYZEon the suspect table in a transaction withSET LOCAL default_statistics_target = 500;and re-run the slow query. Better plan? You have your fix. - For new index proposals: simulate with HypoPG:
The plan in the EXPLAIN output is what the real planner would pick if the index existed. Cost-compare against the current plan before you actuallyCREATE EXTENSION IF NOT EXISTS hypopg; SELECT hypopg_create_index( 'CREATE INDEX ON orders (customer_id, created_at DESC)' ); EXPLAIN (FORMAT JSON, ANALYZE OFF) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 50; SELECT hypopg_reset();CREATE INDEX CONCURRENTLYin production. - For prepared-statement fixes: deploy the code change to a single canary replica via session variables before promoting to the primary.
Lock the plan if you have to
Postgres has no first-class plan hints. The community consensus is that the planner should be fixed, not bypassed. But when the planner refuses to pick the right plan and you have a deadline, you have three escape hatches:
- Disable a node type per session:
SET LOCAL enable_seqscan = off;in the transaction that runs the slow query. Hard hammer but works. - The
pg_hint_planextension: install on the cluster, embed hints as comments (/*+ IndexScan(orders orders_customer_idx) */). Maintain the hint comments in code alongside the SQL. - Rewrite the query: CTE boundaries are optimization fences in Postgres before 12 and behave as such for the planner's join-ordering heuristics. A
WITH a AS MATERIALIZED (...)on Postgres 12+ is the explicit modern way to force the executor to materialize an intermediate result.
Each escape hatch is technical debt. Tag the hint or session override in code with a date and a link to the upstream planner ticket if one exists, so you can revisit when the underlying root cause has a real fix.
Quick checklist
- Snapshot
pg_stat_statementsevery 5 minutes; alert on a 2× jump inreads_per_call. - Run
auto_explainat 10% sample, 200 ms threshold — cheap and covers the long tail. - When a query flips, diff before/after row-estimates first. Most flips are statistics.
- For prepared statements, suspect parameter sniffing — test with
plan_cache_mode = 'force_custom_plan'. - Validate index proposals with
hypopgbeforeCREATE INDEX CONCURRENTLY. - Plan locking is debt. Use only when the upstream fix is landed and your timeline is shorter than the next major.
Related: Reading EXPLAIN plans · Query regression detection · pg_stat_statements