PostgreSQL 17 Logical Replication: Production Se…

Q: Can I do bidirectional replication?

Postgres 17 has the scaffolding (origin filtering, conflict detection plumbing) but no first-class active-active solution. For bidirectional setups today, look at pgEdge, BDR, or the open-source spock extension. Native bidirectional is rumored for 18.

Q: Does logical replication replicate DDL?

Not in core Postgres. Use pglogical or hand-coded migration tooling that runs schema changes on both publisher and subscriber as part of your deploy process.

Q: What about partitioned tables?

Postgres 17 fully supports publishing partitioned tables; you can publish the parent table and have the subscriber receive all partitions. Older versions required publishing each partition individually.

Q: Is logical replication slow?

Per-row apply is slower than physical replication (which just replays WAL bytes). Modern Postgres with parallel apply (streaming = parallel) gets within 2-3x of physical for most workloads. Acceptable for most use cases; not appropriate for the highest-throughput OLTP systems.

Q: How do I monitor lag in Prometheus?

postgres_exporter exposes pg_stat_replication metrics. Alert on pg_replication_slots_pg_wal_lsn_diff exceeding a threshold (e.g., 1 GB) for sustained periods.

Quick summary: PostgreSQL 17 makes logical replication finally usable for production migrations, blue-green deployments, and selective table replication across versions. Failover slots eliminate the "subscribers break on failover" trap that haunted 16.x. The bidirectional pieces are scaffolding for future BDR but useful today for active-active read scaling. This guide walks through a complete production setup: configure publications, build subscribers, monitor lag, handle conflicts, and execute a planned switchover without losing data.

PostgreSQL 17 logical replication production setup walkthrough 2026

Logical vs Physical Replication: When Each Wins

Physical streaming replication has been Postgres's default high-availability tool for over a decade. It is fast, reliable, and operationally simple — the standby is a byte-for-byte copy of the primary. But it has hard constraints: you cannot replicate selectively (it is the entire cluster or nothing), you cannot replicate across major versions, and the standby is read-only with no schema flexibility.

Logical replication solves all of those. It replicates row-level changes via the publication/subscription model, supports cross-version replication (16 → 17 in either direction), allows selective table replication, lets the subscriber accept writes, and supports schema differences between publisher and subscriber. The trade-offs: more operational complexity, no DDL replication out of the box (improving in 17 but still incomplete), and conflict handling becomes the application's problem.

The practical decision tree in 2026:

HA with automatic failover → physical streaming + Patroni or repmgr
Read scaling with full data → physical streaming + multiple read replicas
Major version upgrade with minimal downtime → logical replication
Cross-cluster data sharing → logical replication
Multi-region active-active reads → logical replication, often bidirectional
Heterogeneous topologies → logical replication

What Is New in PostgreSQL 17

Failover slots

The headline feature. Before 17, if your publisher had a physical streaming replica and you failed over to it, all your logical replication slots were gone — subscribers had to be re-initialized from a fresh snapshot. In 17, replication slots can be marked as "failover" slots and are synchronized to physical replicas. After failover, subscribers reconnect to the new primary and continue from where they left off.

This is the change that makes logical replication operationally credible for HA environments. Before 17, "I have logical replication AND a hot standby" required a complicated runbook. In 17, it is a configuration flag.

Streaming of large transactions during decoding

Earlier versions decoded large transactions entirely in memory before sending them to subscribers. A multi-gigabyte UPDATE could exhaust memory on the publisher. In 17, decoding can stream incrementally, dramatically reducing memory pressure.

Improved DDL handling

17 still does not auto-replicate DDL (that is the headline missing feature, and 18 is rumored to deliver it), but it now provides better hooks for tools like pglogical to capture DDL events. The combination of pglogical + Postgres 17 finally gives you reliable schema-evolution-aware replication.

Logical decoding from standby

You can now create logical replication slots on physical standbys. This means you can offload the decoding work from the primary, reducing CPU pressure on hot writers. For very high-throughput publishers, this can be the difference between "logical replication is too expensive" and "logical replication is fine."

The Production Setup: Publisher

postgresql.conf

# Required for logical replication
wal_level = logical

# Sized for production logical replication
max_wal_senders = 16          # raise if you have many subscribers + physical replicas
max_replication_slots = 16    # one per active logical subscription + headroom
max_logical_replication_workers = 8
max_worker_processes = 16     # must accommodate above

# Failover slot synchronization (PG17)
sync_replication_slots = on

# Bandwidth/throughput tuning for busy publishers
wal_sender_timeout = 60s
logical_decoding_work_mem = 256MB

The sync_replication_slots setting is the new 17-specific knob. Combined with failover = true on the slot itself, this is what makes failover-survivable subscriptions possible.

pg_hba.conf

host    replication    repuser    10.0.0.0/8     scram-sha-256
host    appdb          repuser    10.0.0.0/8     scram-sha-256

Logical replication needs both replication-protocol access and regular database access — the subscriber connects as a regular user that happens to have REPLICATION privileges.

Create the role and publication

-- On publisher
CREATE ROLE repuser WITH REPLICATION LOGIN PASSWORD 'strong-password';
GRANT CONNECT ON DATABASE appdb TO repuser;
\c appdb
GRANT USAGE ON SCHEMA public TO repuser;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO repuser;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO repuser;

-- Publish all tables (simple case)
CREATE PUBLICATION app_all FOR ALL TABLES;

-- Or publish specific tables (more common for selective replication)
CREATE PUBLICATION app_orders FOR TABLE orders, order_items, customers;

-- Or with row filters (selective replication)
CREATE PUBLICATION app_active FOR TABLE orders WHERE (status <> 'archived');

The Production Setup: Subscriber

postgresql.conf

wal_level = logical
max_logical_replication_workers = 8
max_worker_processes = 16

Initial schema sync

Logical replication does not (yet) replicate the schema. You must create the table structure on the subscriber first. The cleanest pattern:

# On publisher host, dump schema only
pg_dump -h publisher.example.com -d appdb --schema-only \
    --no-publications --no-subscriptions \
    -f schema.sql

# On subscriber
psql -d appdb -f schema.sql

Create the subscription with PG17 failover support

CREATE SUBSCRIPTION sub_orders
  CONNECTION 'host=publisher.example.com port=5432 dbname=appdb user=repuser password=strong-password'
  PUBLICATION app_orders
  WITH (
    copy_data = true,        -- initial COPY of existing rows
    failover = true,         -- PG17 failover-slot magic
    streaming = parallel,    -- parallel apply for big transactions
    binary = true,           -- faster wire format
    origin = none            -- prevents loops in bidirectional setups
  );

The copy_data = true triggers a one-time bulk COPY of existing rows. For terabyte-scale tables this can take hours and saturate the network — plan accordingly.

Monitoring: The Three Numbers That Matter

Logical replication has many possible failure modes; tracking these three metrics catches almost all of them.

1. Replication lag

-- On publisher
SELECT
  application_name,
  client_addr,
  state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) AS write_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;

A healthy subscriber shows lag in the megabytes during normal load and zero during quiet periods. Lag growing without bound means the subscriber is falling behind — usually because its apply workers are CPU-bound on a hot table, or because a long-running transaction is blocking the apply.

2. Slot disk usage

SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

If a subscriber is offline for hours and the slot is not advancing, the publisher retains all WAL since the last consumed LSN. This grows the pg_wal directory until disk fills. Monitor this aggressively; alert at 50 GB, page at 100 GB.

3. Subscription worker errors

-- On subscriber
SELECT * FROM pg_stat_subscription_stats;

Tracks apply errors, conflicts, and last error message. A non-zero apply_error_count is your signal that something on the subscriber is rejecting incoming changes (constraint violation, missing row, etc.).

Conflict Handling: The Hard Part

The single most common gotcha for teams new to logical replication is realizing that conflicts are real and the subscriber will simply stop applying when one occurs. Worker errors out, replication lag grows, you eventually get paged.

Common conflict scenarios:

UNIQUE violation — a row with the same primary key exists on the subscriber
Missing row on UPDATE/DELETE — the subscriber does not have the row the publisher is trying to update
Schema mismatch — column type changed on subscriber but not publisher (or vice versa)
NULL constraint — subscriber has NOT NULL constraint that publisher does not

Resolution patterns

For each conflict, you have three real options:

Skip the bad transaction: ALTER SUBSCRIPTION sub_name SKIP (lsn = '0/12345'). Use the LSN from the error log. Quick fix, accepts data divergence.
Fix the data and let it retry: identify the conflicting row on the subscriber, delete or update it manually, replication retries automatically.
Re-initialize: drop the subscription, truncate the affected tables, re-create with copy_data=true. The "nuclear option" — works but expensive.

The right strategy depends on your data model. For idempotent event streams, skipping is usually fine. For OLTP source-of-truth replication, fix-and-retry is correct. Re-initialize is a last resort.

Use Case 1: Major Version Upgrade With Minimal Downtime

The classic logical replication use case. Today you run Postgres 15; you want to be on 17 with minutes of downtime, not hours.

Build a new Postgres 17 cluster ("target"). Same data center, same network, same hardware class.
Dump schema from 15, load on 17. Verify with diff against pg_dump --schema-only on both.
Create publication on 15 (FOR ALL TABLES, since this is a full migration).
Create subscription on 17 with copy_data = true. Wait for initial sync to complete (could be hours for big DBs).
Monitor replication lag. When it is consistently <1 MB during peak load, you are ready.
Brief maintenance window: stop writes to the 15 cluster (typically by removing the load balancer pool entry).
Confirm replication has caught up to zero lag.
Update application connection strings to point to the 17 cluster.
Re-enable writes.

Total downtime is usually 1-3 minutes — the time to flip connection strings. Compare to the hours of downtime that pg_upgrade would require for a multi-terabyte database.

Use Case 2: Selective Replication for Analytics

You want a subset of your OLTP database (specific tables, recent rows only) replicated to an analytics environment. Logical replication with publications + row filters does this elegantly:

-- On OLTP primary
CREATE PUBLICATION analytics
  FOR TABLE orders WHERE (created_at > NOW() - INTERVAL '90 days'),
      customers,
      products;

The analytics subscriber gets only orders from the last 90 days plus all customers and products — schema-aligned with your analytics queries, no archive bloat.

The Operational Gotchas Nobody Warns You About

1. Sequences do not replicate

Logical replication ships row-level changes, not sequence advances. After a switchover, sequences on the subscriber are stale. Either reset them post-switchover or use UUIDs and avoid the problem.

2. TOAST values can balloon WAL

Updates to TOAST-stored columns (large text, jsonb) trigger full row replication, even if only a small column changed. Watch WAL volume after enabling logical replication on a wide-row table.

3. Long-running transactions on the subscriber block apply

If a query on the subscriber holds a lock on a target table, the apply worker blocks until released. Reporting queries on subscribers can stall replication. Use statement_timeout on subscriber roles that do reads.

4. ALTER TABLE ADD COLUMN is asymmetric

Adding a NOT NULL column without a default to a publisher table breaks replication immediately (subscriber rejects new rows that lack the column). Always add nullable, backfill, then constrain — the standard expand-contract pattern.

5. Initial sync can starve other workloads

The COPY phase of copy_data = true reads the whole table on the publisher. For multi-TB tables this is hours of sustained sequential I/O. Schedule initial sync for off-peak hours; consider partitioning very large publications.

Frequently Asked Questions

Can I do bidirectional replication?

Postgres 17 has the scaffolding (origin filtering, conflict detection plumbing) but no first-class active-active solution. For bidirectional setups today, look at pgEdge, BDR, or the open-source spock extension. Native bidirectional is rumored for 18.

Does logical replication replicate DDL?

Not in core Postgres. Use pglogical or hand-coded migration tooling that runs schema changes on both publisher and subscriber as part of your deploy process.

What about partitioned tables?

Postgres 17 fully supports publishing partitioned tables; you can publish the parent table and have the subscriber receive all partitions. Older versions required publishing each partition individually.

Is logical replication slow?

Per-row apply is slower than physical replication (which just replays WAL bytes). Modern Postgres with parallel apply (streaming = parallel) gets within 2-3x of physical for most workloads. Acceptable for most use cases; not appropriate for the highest-throughput OLTP systems.

How do I monitor lag in Prometheus?

postgres_exporter exposes pg_stat_replication metrics. Alert on pg_replication_slots_pg_wal_lsn_diff exceeding a threshold (e.g., 1 GB) for sustained periods.

Real-World Example: A Cross-Region Migration

One e-commerce team we worked with used PG17 logical replication to migrate from a single-region Postgres 14 cluster (eu-west-1) to a multi-region active/passive setup with primaries in eu-west-1 and us-east-1. The migration took six weeks of preparation and exactly 90 seconds of customer-facing downtime.

The phases: build the new 17 cluster in parallel with the old one, replicate all tables for 10 days while validating row counts and checksums, run a synthetic application against the 17 cluster to surface any subtle behavior differences, then a brief maintenance window to flip the load balancer. Failover slots meant they could fail over from the 17 primary to its physical standby without re-initializing the cross-region logical subscriptions — a capability that simply did not exist in 16.

The Bottom Line

PostgreSQL 17 makes logical replication finally first-class. Failover slots remove the single biggest operational footgun, parallel apply makes performance acceptable for most workloads, and the row-filter and column-filter features make selective replication clean. If you have been waiting for "the right time" to move from physical-only HA to a logical-replication-aware architecture, the right time is the next major version cycle. Plan the migration carefully, monitor the three numbers that matter, and you have a tool that handles cross-version upgrades, selective replication, and read scaling all from the same primitive.

Categories

Logical vs Physical Replication: When Each Wins

What Is New in PostgreSQL 17

Failover slots

Streaming of large transactions during decoding

Improved DDL handling

Logical decoding from standby

The Production Setup: Publisher

postgresql.conf

pg_hba.conf

Create the role and publication

The Production Setup: Subscriber

postgresql.conf

Initial schema sync

Create the subscription with PG17 failover support

Monitoring: The Three Numbers That Matter

1. Replication lag

2. Slot disk usage

3. Subscription worker errors

Conflict Handling: The Hard Part

Resolution patterns

Use Case 1: Major Version Upgrade With Minimal Downtime

Use Case 2: Selective Replication for Analytics

The Operational Gotchas Nobody Warns You About

1. Sequences do not replicate

2. TOAST values can balloon WAL

3. Long-running transactions on the subscriber block apply

4. ALTER TABLE ADD COLUMN is asymmetric

5. Initial sync can starve other workloads

Frequently Asked Questions

Can I do bidirectional replication?

Does logical replication replicate DDL?

What about partitioned tables?

Is logical replication slow?

How do I monitor lag in Prometheus?

Real-World Example: A Cross-Region Migration

Further Reading from the Dargslan Library

The Bottom Line

Petr Novak

Related Articles

Meet Julien Moreau: The Database and Backend Engineering Expert

Book of the Week: PostgreSQL Database Setup — The Complete Guide to Production Databases

Redis Caching: Setup and Best Practices for Application Performance

Stay Updated