Recover out of sync PostgreSQL slave/replica (WAL has already been removed)

WAL or Write-Ahead Logging as a method that enables PostgreSQL data integrity. Mostly used in replication as it enables High Availability (HA). Now, PostgreSQL replication is a broad topic to cover, so I’ll stick with the issue I’ve faced not long ago on a production environment replication server.

master and main, slave, replica and standby are used interchangeably.

As I was checking the main server error logs I’ve come across the following error message: ERROR: requested WAL segment 00000003000000070000005D has already been removed.

Prerequisites

PostgreSQL Cluster

Solution

Basically, WAL files are written and sent to the standby (replica) servers in an asynchronous way. So, if you faced the same error as me, it turns out that some WAL files have been already removed from the main server before being sent to the replicas, hence you got inconsistent replica servers.

So now, you got to deal with recovering these replica servers.

Step 1. Stop the replica’s PostgreSQL service.

systemctl stop postgresql

Step 2. Just in case, create a backup from the PostgreSQL data directory. To find out the data dir path which by default is under /var/lib/postgresql/<postgresql_version>/main, you have few options, including:

Search for any data_directory line under /etc/postgresql/<postgresql_version>/main/postgresql.conf.

or, just run:

pg_lsclusters

or, even try running a SQL query:

SHOW data_directory;

Step 3. Next, backup the datadir. For instance:

cd /var/lib/postgresql/10
mkdir main-backup
cp -r main/* main-backup/

Step 4. Empty (YOLO) the datadir:

rm -rf /var/lib/postgresql/10/main*

Step 5. As for the “magic” restore command goes, run:

pg_basebackup -h 10.0.1.2 -U replica -Xs -P -R -D /var/lib/postgresql/10/main/

-h: master server IP address.
-U: the authentication PostgreSQL user that have permissions to execute the command.
-Xs: collect the WAL files as the backup is running in background.
-P: if you want to follow the progress.
-R: write the connection settings to a recovery.conf file.

Step 6. Cross your fingers and start the PostgreSQL service.

systemctl start postgresql

Step 7. Watch for any error logs usually saved under /var/log/postgresql/postgresql-<postgresql_version>-main.log.

dirty fix

You can always set wal_keep_segments found under postgresql.conf to some value greater than the default one. This will lead to increased number of WAL files as well as disk space.

do not try this at hime

Lastly, if you just want to mess around, haven’t tried it yet though, but if you do not care about downtime or anything at all:

Stop the master and replica instances.
Empty the replicas pg_wal directory → rm /var/lib/postgresql/<postgresql_version>/main/pg_wal/*.
Copy/sync the WAL files from the main server found under /var/lib/postgresql/<postgresql_version/main/pg_wal to the inconsistent replica servers. Use rsync, scp or some alternate SSH copy utility tools.

Conclusion

If you have another great solution to share, feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.