Hello,
I am trying to set up Stolon on a GCE Kubernetes cluster. I have followed the instructions explained here: https://disaev.me/p/how-i-have-set-up-ha-postgresql-on-kubernetes/
This is the config I am using right now:
USER-SUPPLIED VALUES:
clusterSpec:
initMode: new
maxSynchronousStandbys: 1
minSynchronousStandbys: 1
synchronousReplication: true
debug: false
etcdImage:
pullPolicy: IfNotPresent
repository: k8s.gcr.io/etcd-amd64
tag: 2.3.7
image:
pullPolicy: IfNotPresent
repository: sorintlab/stolon
tag: v0.16.0-pg12
job:
annotations: {}
autoCreateCluster: true
autoUpdateClusterSpec: true
keeper:
affinity: {}
annotations: {}
extraEnv: []
fsGroup: ""
hooks:
failKeeper:
enabled: false
nodeSelector: {}
podDisruptionBudget: null
priorityClassName: ""
replicaCount: 3
resources: {}
service:
annotations: {}
ports:
keeper:
port: 5432
protocol: TCP
targetPort: 5432
type: ClusterIP
tolerations: []
uid_prefix: keeper
volumeMounts: []
volumes: []
persistence:
accessModes:
- ReadWriteOnce
enabled: true
size: 200Gi
storageClassName: "standard"
pgParameters:
max_connections: "100"
shared_buffers: 1536MB
effective_cache_size: 4608MB
maintenance_work_mem: 768MB
checkpoint_completion_target: "0.9"
wal_buffers: 16MB
default_statistics_target: "500"
random_page_cost: "1.1"
effective_io_concurrency: "200"
work_mem: 7864kB
min_wal_size: 4GB
max_wal_size: 16GB
max_worker_processes: "2"
max_parallel_workers_per_gather: "1"
max_parallel_workers: "2"
max_parallel_maintenance_workers: "1"
log_min_duration_statement: "100"
ports:
metrics:
containerPort: 8080
stolon:
containerPort: 5432
proxy:
affinity: {}
annotations: {}
extraEnv: []
nodeSelector: {}
podDisruptionBudget: null
priorityClassName: ""
replicaCount: 3
resources: {}
service:
annotations: {}
ports:
proxy:
port: 5432
protocol: TCP
targetPort: 5432
type: ClusterIP
tolerations: []
rbac:
create: true
replicationPassword: null
replicationPasswordFile: null
replicationSecret:
name: pg-repl
passwordKey: password
usernameKey: username
replicationUsername: repl_username
sentinel:
affinity: {}
annotations: {}
extraEnv: []
nodeSelector: {}
podDisruptionBudget: null
priorityClassName: ""
replicaCount: 3
resources: {}
tolerations: []
serviceAccount:
create: true
name: null
serviceMonitor:
enabled: false
labels: {}
shmVolume:
enabled: false
store:
backend: kubernetes
kubeResourceKind: configmap
superuserPassword: null
superuserPasswordFile: null
superuserSecret:
name: pg-su
passwordKey: password
usernameKey: username
superuserUsername: postgres
tls:
enabled: false
rootCa: ""
serverCrt: ""
serverKey: ""
I have set the secrets like so:
kubectl create secret generic pg-su --namespace default --from-literal=username='postgres' --from-literal=password='MYPASSWORDHERE'
kubectl create secret generic pg-repl --namespace default --from-literal=username='repl_username' --from-literal=password='MYPASSWORDTHERE'
The main problem I am facing is that replication just doesn’t seem to happen at all.
This is the output given by my keeper-0 pod:
2020-09-09T03:50:04.605Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-pg-repl/password", "mode": "01000000777"}
2020-09-09T03:50:04.606Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon-pg-su/password", "mode": "01000000777"}
2020-09-09T03:50:04.607Z INFO cmd/keeper.go:2039 exclusive lock on data dir taken
2020-09-09T03:50:04.610Z INFO cmd/keeper.go:525 keeper uid {"uid": "keeper0"}
2020-09-09T03:50:04.644Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-09-09T03:50:09.652Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-09-09T03:50:14.659Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-09-09T03:50:19.668Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-09-09T03:50:24.678Z ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-09-09T03:50:24.683Z INFO cmd/keeper.go:1094 current db UID different than cluster data db UID {"db": "", "cdDB": "fdc90e49"}
2020-09-09T03:50:24.683Z INFO cmd/keeper.go:1101 initializing the database cluster
The files belonging to this database system will be owned by user "stolon".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /stolon-data/postgres ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /stolon-data/postgres -l logfile start
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
2020-09-09T03:50:25.739Z INFO postgresql/postgresql.go:319 starting database
2020-09-09 03:50:25.755 UTC [45] LOG: starting PostgreSQL 12.2 (Debian 12.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-09-09 03:50:25.755 UTC [45] LOG: listening on IPv4 address "10.28.2.51", port 5432
2020-09-09 03:50:25.762 UTC [45] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-09 03:50:25.817 UTC [46] LOG: database system was shut down at 2020-09-09 03:50:25 UTC
2020-09-09 03:50:25.827 UTC [45] LOG: database system is ready to accept connections
2020-09-09T03:50:25.962Z INFO cmd/keeper.go:1160 setting roles
2020-09-09T03:50:25.963Z INFO postgresql/postgresql.go:571 setting superuser password
2020-09-09T03:50:25.968Z INFO postgresql/postgresql.go:575 superuser password set
2020-09-09T03:50:25.969Z INFO postgresql/postgresql.go:578 creating replication role
2020-09-09T03:50:25.977Z INFO postgresql/postgresql.go:588 replication role created {"role": "repl_username"}
2020-09-09T03:50:25.981Z INFO postgresql/postgresql.go:384 stopping database
2020-09-09 03:50:25.982 UTC [45] LOG: received fast shutdown request
2020-09-09 03:50:25.986 UTC [45] LOG: aborting any active transactions
2020-09-09 03:50:25.990 UTC [45] LOG: background worker "logical replication launcher" (PID 52) exited with exit code 1
2020-09-09 03:50:25.990 UTC [47] LOG: shutting down
2020-09-09 03:50:26.040 UTC [45] LOG: database system is shut down
waiting for server to shut down.... done
server stopped
2020-09-09T03:50:26.110Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:26.130Z INFO postgresql/postgresql.go:319 starting database
2020-09-09 03:50:26.146 UTC [66] LOG: starting PostgreSQL 12.2 (Debian 12.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-09-09 03:50:26.146 UTC [66] LOG: listening on IPv4 address "10.28.2.51", port 5432
2020-09-09 03:50:26.152 UTC [66] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-09 03:50:26.205 UTC [67] LOG: database system was shut down at 2020-09-09 03:50:26 UTC
2020-09-09 03:50:26.212 UTC [66] LOG: database system is ready to accept connections
2020-09-09T03:50:26.338Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:26.356Z INFO cmd/keeper.go:1626 postgres parameters not changed
2020-09-09T03:50:26.357Z INFO cmd/keeper.go:1653 postgres hba entries not changed
2020-09-09T03:50:31.436Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:31.438Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:31.465Z INFO cmd/keeper.go:1626 postgres parameters not changed
2020-09-09T03:50:31.465Z INFO cmd/keeper.go:1653 postgres hba entries not changed
2020-09-09T03:50:36.528Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:36.530Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:36.555Z INFO cmd/keeper.go:1626 postgres parameters not changed
2020-09-09T03:50:36.555Z INFO cmd/keeper.go:1653 postgres hba entries not changed
2020-09-09T03:50:41.636Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:41.639Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:41.656Z INFO cmd/keeper.go:1626 postgres parameters not changed
2020-09-09T03:50:41.657Z INFO cmd/keeper.go:1653 postgres hba entries not changed
So everything seems to start fine. However, right after that:
2020-09-09 03:50:43.737 UTC [127] FATAL: password authentication failed for user "repl_username"
2020-09-09 03:50:43.737 UTC [127] DETAIL: Password does not match for user "repl_username".
Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09 03:50:44.692 UTC [128] FATAL: password authentication failed for user "repl_username"
2020-09-09 03:50:44.692 UTC [128] DETAIL: Password does not match for user "repl_username".
Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09T03:50:46.714Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:46.716Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:46.721Z INFO cmd/keeper.go:961 creating replication slot {"slot": "stolon_53e340c1"}
2020-09-09T03:50:46.735Z INFO cmd/keeper.go:961 creating replication slot {"slot": "stolon_fa61f70f"}
2020-09-09T03:50:46.760Z INFO cmd/keeper.go:1609 needed synchronous_standby_names changed {"prevSyncStandbyNames": "", "syncStandbyNames": "stolonfakestandby"}
2020-09-09T03:50:46.761Z INFO cmd/keeper.go:1621 postgres parameters changed, reloading postgres instance
2020-09-09T03:50:46.761Z INFO cmd/keeper.go:1653 postgres hba entries not changed
2020-09-09T03:50:46.761Z INFO postgresql/postgresql.go:421 reloading database configuration
2020-09-09 03:50:46.780 UTC [66] LOG: received SIGHUP, reloading configuration files
2020-09-09 03:50:46.781 UTC [66] LOG: parameter "synchronous_standby_names" changed to "stolonfakestandby"
server signaled
2020-09-09 03:50:48.842 UTC [149] FATAL: password authentication failed for user "repl_username"
2020-09-09 03:50:48.842 UTC [149] DETAIL: Password does not match for user "repl_username".
Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09 03:50:49.790 UTC [150] FATAL: password authentication failed for user "repl_username"
2020-09-09 03:50:49.790 UTC [150] DETAIL: Password does not match for user "repl_username".
Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09T03:50:51.845Z INFO cmd/keeper.go:1457 our db requested role is master
2020-09-09T03:50:51.847Z INFO cmd/keeper.go:1493 already master
2020-09-09T03:50:51.866Z INFO cmd/keeper.go:1626 postgres parameters not changed
2020-09-09T03:50:51.866Z INFO cmd/keeper.go:1653 postgres hba entries not changed
And the final 8-line loop goes on like this forever. Connecting to the db works through the proxy service, but any command is kept on hold, waiting for replication indefinitely, until I hit Ctrl-C:
kubectl exec -it stolon-postgres-keeper-0 -- psql -h 10.94.4.60 -p 5432 -U postgres -W -d postgres
Password:
psql (12.2 (Debian 12.2-1.pgdg100+1))
Type "help" for help.
postgres=# CREATE EXTENSION UNACCENT;
^CCancel request sent
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
CREATE EXTENSION
postgres=# \q
What have I missed? Can you help me?