How do I detect a plan flip without expensive tooling?

Snapshot pg_stat_statements every 5 minutes and alert when shared_blks_read divided by calls doubles for a queryid. Queries that swap from index scan to seq scan show a sharp jump in reads-per-call; that ratio is a free flip signal.

Can I force Postgres to lock a plan?

Postgres has no first-class plan hints. Workarounds: SET LOCAL enable_seqscan = off per transaction, the pg_hint_plan extension to embed hints as comments, or rewrite the query with WITH ... AS MATERIALIZED to force a join boundary. All three are technical debt — tag and revisit when the planner ships a real fix.

Does auto_explain hurt production performance?

Negligibly when you set log_min_duration to 200 ms or higher and sample_rate to 0.1 (10% of slow queries). At that budget, auto_explain processes well under 1% of queries and the EXPLAIN itself runs on data the executor already had in memory.

Postgres Plan Flip Diagnosis

A query that ran in 12 ms yesterday takes 4.6 seconds today. The schema is unchanged. The data volume looks normal. The application code has not been deployed. What changed is theplan: the Postgres planner picked a different join order or scan type, and that choice is now wrong. Plan flips are some of the hardest performance regressions to track down precisely because the artifact that changed lives inside Postgres rather than in a Git commit. This guide is the diagnostic loop: detect, isolate the trigger, validate the fix.

What causes a plan flip

The planner picks a plan by estimating cost. Cost is a function of three inputs: schema (indexes, constraints, partitioning), statistics (column histograms, distinct counts, correlation), and parameters (the actual values bound to a prepared statement). Change any one and the chosen plan can shift. The five most common triggers, ordered by frequency:

Statistics refresh. Autovacuum or a manual ANALYZE updates a histogram, the new histogram changes a selectivity estimate, the planner switches join order. Most plan flips are this.
Parameter sniffing. A prepared statement built its generic plan around a frequent parameter value. When called with an outlier, the same plan does the wrong thing.
Data-volume crossing. A table grew past the point where the planner's seq-scan vs index-scan estimate tips. Often hits 1–6 months after launch.
Catalog change. A new index was added; the planner now sees a cheaper access path and chooses it. The new path may not actually be cheaper — covering different columns or being un-warm in cache.
Postgres version upgrade. Cost-model tweaks between major versions change which plan wins by a hair. Always re-baseline EXPLAIN output after a major.

Detect: capture EXPLAIN over time

You cannot diagnose what you have not recorded. Three tools, in increasing capture fidelity:

pg_stat_statements: structural change as a signal

pg_stat_statements does not store plans, but plan flips produce a sudden change in the ratio of shared_blks_read to calls — a query that switched from index to seq scan reads orders of magnitude more pages per call. Snapshot every 5 minutes and alert when the ratio doubles:

WITH base AS (
  SELECT
    queryid,
    sum(calls)                    AS calls,
    sum(shared_blks_read)         AS reads,
    sum(shared_blks_hit)          AS hits
  FROM pg_stat_statements
  GROUP BY queryid
)
SELECT
  queryid,
  calls,
  reads,
  hits,
  round((reads::numeric / nullif(calls, 0)), 2) AS reads_per_call
FROM base
WHERE calls > 100
ORDER BY reads_per_call DESC
LIMIT 20;

auto_explain: capture plans for slow queries

auto_explain is the workhorse. Enable it in postgresql.conf with a duration threshold:

shared_preload_libraries = 'pg_stat_statements,auto_explain'

auto_explain.log_min_duration       = 200ms
auto_explain.log_analyze            = on
auto_explain.log_buffers            = on
auto_explain.log_timing             = on
auto_explain.log_nested_statements  = on
auto_explain.sample_rate            = 0.1     # 10% of slow queries

Every query above 200 ms (sampled at 10%) now gets its full EXPLAIN ANALYZE logged. Ship the logs to a log store and you have a searchable history of plan choices per fingerprint. When a query gets slow, look at the most-recent entry for that fingerprint and the most-recent entry before it got slow — the diff between the two is the plan flip.

pg_store_plans / TuneStack PlanFlip

The third-party pg_store_plans extension stores plan hashes perpg_stat_statements queryid. When the hash for a queryid changes, you have a plan flip — directly. TuneStack's PlanFlip detector does the same with an in-house plan fingerprint plus diff visualization, so the operator does not have to read raw EXPLAIN trees side-by-side.

Isolate: what triggered the flip

Once you have the before/after plans, work top-down:

1. Diff the row estimates

Look at the (rows=...) annotations on every node. If the estimate at the bottom of the tree changed by more than a factor of 10, the trigger was a statistics refresh. Verify by checking when last_analyze or last_autoanalyze last ran:

SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

If last_autoanalyze aligns with when the query went slow, the trigger is statistics. The fix is usually either extended statistics (CREATE STATISTICS for correlated columns) or a more frequent analyze cadence on that table.

2. Check for parameter sniffing

If the query is a prepared statement, the plan is the generic plan built on the first few invocations. Force a custom plan per call:

SET plan_cache_mode = 'force_custom_plan';

-- inside the application connection
PREPARE q (int) AS
SELECT * FROM orders WHERE customer_id = $1;

EXPLAIN ANALYZE EXECUTE q (12345);   -- frequent value
EXPLAIN ANALYZE EXECUTE q (999999);  -- outlier

If the outlier plan is good but the cached generic plan is bad, parameter sniffing is the trigger. Solutions, in order of invasiveness: cycle the prepared statement after deploy, switch the cursor to literal SQL for the outlier path, or useplan_cache_mode = 'force_custom_plan' for the affected statement.

3. Look for catalog change

pg_stat_user_indexes and pg_index tell you when an index was created. A recent indisready=true,indisvalid=true transition on the same relation is a candidate trigger:

SELECT
  c.relname AS index,
  i.indrelid::regclass AS table,
  pg_size_pretty(pg_relation_size(c.oid)) AS size,
  s.idx_scan,
  s.idx_tup_read
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
LEFT JOIN pg_stat_user_indexes s ON s.indexrelid = c.oid
WHERE i.indrelid::regclass::text = 'orders'
ORDER BY c.relname;

An index that the planner is using (high idx_scan) but that was created the same day the query got slow is a primary suspect. Disable it with HypoPG in a session and re-run the query — if the plan reverts, you have confirmed.

Validate: prove the fix without merging it

Never push an ANALYZE-frequency change or a new index blindly. Validate first:

For statistics fixes: run ANALYZE on the suspect table in a transaction with SET LOCAL default_statistics_target = 500; and re-run the slow query. Better plan? You have your fix.

For new index proposals: simulate with HypoPG:

CREATE EXTENSION IF NOT EXISTS hypopg;

SELECT hypopg_create_index(
  'CREATE INDEX ON orders (customer_id, created_at DESC)'
);

EXPLAIN (FORMAT JSON, ANALYZE OFF)
SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 50;

SELECT hypopg_reset();

The plan in the EXPLAIN output is what the real planner would pick if the index existed. Cost-compare against the current plan before you actually CREATE INDEX CONCURRENTLY in production.

For prepared-statement fixes: deploy the code change to a single canary replica via session variables before promoting to the primary.

Lock the plan if you have to

Postgres has no first-class plan hints. The community consensus is that the planner should be fixed, not bypassed. But when the planner refuses to pick the right plan and you have a deadline, you have three escape hatches:

Disable a node type per session: SET LOCAL enable_seqscan = off; in the transaction that runs the slow query. Hard hammer but works.
The pg_hint_plan extension: install on the cluster, embed hints as comments (/*+ IndexScan(orders orders_customer_idx) */). Maintain the hint comments in code alongside the SQL.
Rewrite the query: CTE boundaries are optimization fences in Postgres before 12 and behave as such for the planner's join-ordering heuristics. AWITH a AS MATERIALIZED (...) on Postgres 12+ is the explicit modern way to force the executor to materialize an intermediate result.

Each escape hatch is technical debt. Tag the hint or session override in code with a date and a link to the upstream planner ticket if one exists, so you can revisit when the underlying root cause has a real fix.

Quick checklist

Snapshot pg_stat_statements every 5 minutes; alert on a 2× jump in reads_per_call.
Run auto_explain at 10% sample, 200 ms threshold — cheap and covers the long tail.
When a query flips, diff before/after row-estimates first. Most flips are statistics.
For prepared statements, suspect parameter sniffing — test with plan_cache_mode = 'force_custom_plan'.
Validate index proposals with hypopg beforeCREATE INDEX CONCURRENTLY.
Plan locking is debt. Use only when the upstream fix is landed and your timeline is shorter than the next major.