Skip to content

Postgres MVCC Internals

Multi-Version Concurrency Control (MVCC) is the mechanism that lets Postgres readers and writers proceed without blocking each other. Instead of locking rows, Postgres keeps multiple versions of each row and lets each transaction see a consistent snapshot of the database. Understanding how this works explains why vacuum exists, where table bloat comes from, and why long-running transactions are dangerous.

Tuple visibility headers: xmin and xmax

Every row in Postgres — called a tuple — carries two hidden system columns in its header:

  • xmin — the transaction ID (XID) of the transaction that inserted this tuple version. The tuple is visible to transactions that started after xmin committed.
  • xmax — the XID of the transaction that deleted or superseded this tuple version. Zero means the tuple has not been deleted. A non-zero xmax that has committed means the tuple is dead.

You can inspect these directly:

SELECT xmin, xmax, id, email
FROM users
WHERE id = 42;

An UPDATE does not modify a row in place. It writes a new tuple with a new xmin and sets xmax on the old tuple to the updating transaction's XID. Both versions coexist on disk until vacuum reclaims the old one.

Snapshot isolation

When a transaction starts (or, in READ COMMITTED mode, when each statement starts), Postgres captures a snapshot: the current XID counter and a list of XIDs that are currently in-progress. The snapshot defines visibility:

  • A tuple is visible if xmin committed before the snapshot was taken and either xmax is zero or xmax had not yet committed when the snapshot was taken.
  • A tuple inserted by a transaction that was in-progress when the snapshot was captured is invisible — even if that transaction has since committed.

This means two concurrent transactions reading the same row see different versions without any locking. The reader never blocks the writer, and the writer never blocks the reader.

The two isolation levels Postgres supports:

  • READ COMMITTED (default): snapshot taken per statement. A long query will see rows committed by other transactions mid-execution. Safe for most OLTP work.
  • REPEATABLE READ / SERIALIZABLE: snapshot taken once at transaction start. All statements in the transaction see the same consistent state. Required for analytics queries that span multiple statements and for financial correctness.

Why updates are expensive (heap-only tuples)

The write amplification from MVCC is real. A single UPDATE writes a full new tuple, updates every index that covers the changed column, and marks the old tuple dead. On a table with 10 indexes, updating one column touches 11 heap or index pages plus the WAL record for each.

Postgres partially mitigates this with Heap-Only Tuples (HOT): if the new tuple version fits on the same heap page as the old one and no indexed column changed, Postgres skips index updates entirely and chains the old and new tuples via an in-page pointer. HOT updates are significantly cheaper. Monitor HOT effectiveness:

SELECT relname,
       n_tup_upd,
       n_tup_hot_upd,
       round(100.0 * n_tup_hot_upd / nullif(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0
ORDER BY n_tup_upd DESC;

A low hot_pct on a write-heavy table usually means one of: not enough page free space (tune fillfactor), or too many indexes covering volatile columns.

Dead tuples and vacuum's role

Dead tuples — old versions with a committed xmax — sit in the heap indefinitely until vacuum removes them. Until then, every sequential scan must skip over them, and indexes retain entries pointing to them.

Vacuum does two passes:

  • Heap pass: scans pages, identifies dead tuples, marks the space as reusable. New inserts can reclaim this space on the same page without extending the file.
  • Index pass: for each index, removes entries pointing to dead tuples. This is the expensive part — it requires touching every index, not just the ones for changed columns.

Vacuum does not return space to the OS by default. VACUUM FULL does, but it requires an exclusive lock and rewrites the entire table. For routine maintenance, regular vacuum reclaiming space in-place is the correct approach.

Transaction ID wraparound: the existential risk

Postgres uses 32-bit transaction IDs, cycling through approximately 2.1 billion values. Each XID is compared to the current XID to determine "past" vs "future." When a table has unfrozen tuples with very old XIDs, Postgres can no longer determine their visibility — they would appear to be from the future, making all data in the table invisible. This is an unrecoverable state.

Vacuum's freeze pass addresses this: it replaces old xmin values with a special FrozenTransactionId marker, meaning "this tuple is permanently visible to all transactions." Autovacuum triggers a freeze pass automatically as tables approach the autovacuum_freeze_max_age threshold (default: 200 million transactions from current XID).

Check XID age across all tables:

SELECT relname,
       age(relfrozenxid)          AS xid_age,
       pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;

Alert threshold: any table above 150 million transactions. At 1.5 billion, Postgres forces an aggressive vacuum that can spike I/O. At 2 billion, the database refuses new connections and shuts down.

Long transactions block vacuum

Vacuum cannot remove a dead tuple if any active transaction's snapshot might still need it. The oldest active transaction's XID defines the xmin horizon — vacuum cannot advance past it. A single long-running transaction (a forgotten BEGIN in a psql session, a stuck background job) pins the xmin horizon and causes dead tuples to pile up across the entire database, not just the tables the long transaction is using.

Find long-running transactions:

SELECT pid,
       now() - xact_start          AS txn_age,
       now() - state_change         AS state_age,
       state,
       left(query, 80)              AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND now() - xact_start > interval '5 minutes'
ORDER BY txn_age DESC;

Set idle_in_transaction_session_timeout to automatically terminate sessions that hold open transactions without executing queries:

SET idle_in_transaction_session_timeout = '5min';

Quick checklist

  • Inspect xmin/xmax on suspicious rows to understand what vacuum has and hasn't cleaned up.
  • Monitor hot_pct — low values on write-heavy tables indicate fillfactor or index over-coverage issues.
  • Alert on XID age above 150 million; never let it reach 1.5 billion.
  • Set idle_in_transaction_session_timeout to prevent long transactions from pinning the xmin horizon.
  • Use REPEATABLE READ for multi-statement analytics; use READ COMMITTED for short OLTP transactions.

Related: Autovacuum Tuning · Postgres Index Bloat

Postgres MVCC Internals | TuneStack