Password authentication fails for repl_user

MoskitoHero · September 9, 2020, 4:14am

Hello,
I am trying to set up Stolon on a GCE Kubernetes cluster. I have followed the instructions explained here: https://disaev.me/p/how-i-have-set-up-ha-postgresql-on-kubernetes/
This is the config I am using right now:

USER-SUPPLIED VALUES:
clusterSpec:
  initMode: new
  maxSynchronousStandbys: 1
  minSynchronousStandbys: 1
  synchronousReplication: true
debug: false
etcdImage:
  pullPolicy: IfNotPresent
  repository: k8s.gcr.io/etcd-amd64
  tag: 2.3.7
image:
  pullPolicy: IfNotPresent
  repository: sorintlab/stolon
  tag: v0.16.0-pg12
job:
  annotations: {}
  autoCreateCluster: true
  autoUpdateClusterSpec: true
keeper:
  affinity: {}
  annotations: {}
  extraEnv: []
  fsGroup: ""
  hooks:
failKeeper:
  enabled: false
  nodeSelector: {}
  podDisruptionBudget: null
  priorityClassName: ""
  replicaCount: 3
  resources: {}
  service:
annotations: {}
ports:
  keeper:
    port: 5432
    protocol: TCP
    targetPort: 5432
type: ClusterIP
  tolerations: []
  uid_prefix: keeper
  volumeMounts: []
  volumes: []
persistence:
  accessModes:
  - ReadWriteOnce
  enabled: true
  size: 200Gi
  storageClassName: "standard"
pgParameters:
  max_connections: "100"
  shared_buffers: 1536MB
  effective_cache_size: 4608MB
  maintenance_work_mem: 768MB
  checkpoint_completion_target: "0.9"
  wal_buffers: 16MB
  default_statistics_target: "500"
  random_page_cost: "1.1"
  effective_io_concurrency: "200"
  work_mem: 7864kB
  min_wal_size: 4GB
  max_wal_size: 16GB
  max_worker_processes: "2"
  max_parallel_workers_per_gather: "1"
  max_parallel_workers: "2"
  max_parallel_maintenance_workers: "1"
  log_min_duration_statement: "100"
ports:
  metrics:
containerPort: 8080
  stolon:
containerPort: 5432
proxy:
  affinity: {}
  annotations: {}
  extraEnv: []
  nodeSelector: {}
  podDisruptionBudget: null
  priorityClassName: ""
  replicaCount: 3
  resources: {}
  service:
annotations: {}
ports:
  proxy:
    port: 5432
    protocol: TCP
    targetPort: 5432
type: ClusterIP
  tolerations: []
rbac:
  create: true
replicationPassword: null
replicationPasswordFile: null
replicationSecret:
  name: pg-repl
  passwordKey: password
  usernameKey: username
replicationUsername: repl_username
sentinel:
  affinity: {}
  annotations: {}
  extraEnv: []
  nodeSelector: {}
  podDisruptionBudget: null
  priorityClassName: ""
  replicaCount: 3
  resources: {}
  tolerations: []
serviceAccount:
  create: true
  name: null
serviceMonitor:
  enabled: false
  labels: {}
shmVolume:
  enabled: false
store:
  backend: kubernetes
  kubeResourceKind: configmap
superuserPassword: null
superuserPasswordFile: null
superuserSecret:
  name: pg-su
  passwordKey: password
  usernameKey: username
superuserUsername: postgres
tls:
  enabled: false
  rootCa: ""
  serverCrt: ""
  serverKey: ""

I have set the secrets like so:

kubectl create secret generic pg-su --namespace default --from-literal=username='postgres' --from-literal=password='MYPASSWORDHERE'
kubectl create secret generic pg-repl  --namespace default  --from-literal=username='repl_username' --from-literal=password='MYPASSWORDTHERE'

The main problem I am facing is that replication just doesn’t seem to happen at all.

This is the output given by my keeper-0 pod:

2020-09-09T03:50:04.605Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-pg-repl/password", "mode": "01000000777"}
2020-09-09T03:50:04.606Z	WARN	cmd/keeper.go:182	password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...	{"file": "/etc/secrets/stolon-pg-su/password", "mode": "01000000777"}
2020-09-09T03:50:04.607Z	INFO	cmd/keeper.go:2039	exclusive lock on data dir taken
2020-09-09T03:50:04.610Z	INFO	cmd/keeper.go:525	keeper uid	{"uid": "keeper0"}
2020-09-09T03:50:04.644Z	INFO	cmd/keeper.go:1033	our keeper data is not available, waiting for it to appear
2020-09-09T03:50:09.652Z	INFO	cmd/keeper.go:1033	our keeper data is not available, waiting for it to appear
2020-09-09T03:50:14.659Z	INFO	cmd/keeper.go:1033	our keeper data is not available, waiting for it to appear
2020-09-09T03:50:19.668Z	INFO	cmd/keeper.go:1033	our keeper data is not available, waiting for it to appear
2020-09-09T03:50:24.678Z	ERROR	cmd/keeper.go:1063	db failed to initialize or resync
2020-09-09T03:50:24.683Z	INFO	cmd/keeper.go:1094	current db UID different than cluster data db UID	{"db": "", "cdDB": "fdc90e49"}
2020-09-09T03:50:24.683Z	INFO	cmd/keeper.go:1101	initializing the database cluster
The files belonging to this database system will be owned by user "stolon".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /stolon-data/postgres ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok


Success. You can now start the database server using:

    pg_ctl -D /stolon-data/postgres -l logfile start

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
2020-09-09T03:50:25.739Z	INFO	postgresql/postgresql.go:319	starting database
2020-09-09 03:50:25.755 UTC [45] LOG:  starting PostgreSQL 12.2 (Debian 12.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-09-09 03:50:25.755 UTC [45] LOG:  listening on IPv4 address "10.28.2.51", port 5432
2020-09-09 03:50:25.762 UTC [45] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-09 03:50:25.817 UTC [46] LOG:  database system was shut down at 2020-09-09 03:50:25 UTC
2020-09-09 03:50:25.827 UTC [45] LOG:  database system is ready to accept connections
2020-09-09T03:50:25.962Z	INFO	cmd/keeper.go:1160	setting roles
2020-09-09T03:50:25.963Z	INFO	postgresql/postgresql.go:571	setting superuser password
2020-09-09T03:50:25.968Z	INFO	postgresql/postgresql.go:575	superuser password set
2020-09-09T03:50:25.969Z	INFO	postgresql/postgresql.go:578	creating replication role
2020-09-09T03:50:25.977Z	INFO	postgresql/postgresql.go:588	replication role created	{"role": "repl_username"}
2020-09-09T03:50:25.981Z	INFO	postgresql/postgresql.go:384	stopping database
2020-09-09 03:50:25.982 UTC [45] LOG:  received fast shutdown request
2020-09-09 03:50:25.986 UTC [45] LOG:  aborting any active transactions
2020-09-09 03:50:25.990 UTC [45] LOG:  background worker "logical replication launcher" (PID 52) exited with exit code 1
2020-09-09 03:50:25.990 UTC [47] LOG:  shutting down
2020-09-09 03:50:26.040 UTC [45] LOG:  database system is shut down
waiting for server to shut down.... done
server stopped
2020-09-09T03:50:26.110Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:26.130Z	INFO	postgresql/postgresql.go:319	starting database
2020-09-09 03:50:26.146 UTC [66] LOG:  starting PostgreSQL 12.2 (Debian 12.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2020-09-09 03:50:26.146 UTC [66] LOG:  listening on IPv4 address "10.28.2.51", port 5432
2020-09-09 03:50:26.152 UTC [66] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-09 03:50:26.205 UTC [67] LOG:  database system was shut down at 2020-09-09 03:50:26 UTC
2020-09-09 03:50:26.212 UTC [66] LOG:  database system is ready to accept connections
2020-09-09T03:50:26.338Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:26.356Z	INFO	cmd/keeper.go:1626	postgres parameters not changed
2020-09-09T03:50:26.357Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed
2020-09-09T03:50:31.436Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:31.438Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:31.465Z	INFO	cmd/keeper.go:1626	postgres parameters not changed
2020-09-09T03:50:31.465Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed
2020-09-09T03:50:36.528Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:36.530Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:36.555Z	INFO	cmd/keeper.go:1626	postgres parameters not changed
2020-09-09T03:50:36.555Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed
2020-09-09T03:50:41.636Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:41.639Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:41.656Z	INFO	cmd/keeper.go:1626	postgres parameters not changed
2020-09-09T03:50:41.657Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed

So everything seems to start fine. However, right after that:

2020-09-09 03:50:43.737 UTC [127] FATAL:  password authentication failed for user "repl_username"
2020-09-09 03:50:43.737 UTC [127] DETAIL:  Password does not match for user "repl_username".
	Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09 03:50:44.692 UTC [128] FATAL:  password authentication failed for user "repl_username"
2020-09-09 03:50:44.692 UTC [128] DETAIL:  Password does not match for user "repl_username".
	Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09T03:50:46.714Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:46.716Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:46.721Z	INFO	cmd/keeper.go:961	creating replication slot	{"slot": "stolon_53e340c1"}
2020-09-09T03:50:46.735Z	INFO	cmd/keeper.go:961	creating replication slot	{"slot": "stolon_fa61f70f"}
2020-09-09T03:50:46.760Z	INFO	cmd/keeper.go:1609	needed synchronous_standby_names changed	{"prevSyncStandbyNames": "", "syncStandbyNames": "stolonfakestandby"}
2020-09-09T03:50:46.761Z	INFO	cmd/keeper.go:1621	postgres parameters changed, reloading postgres instance
2020-09-09T03:50:46.761Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed
2020-09-09T03:50:46.761Z	INFO	postgresql/postgresql.go:421	reloading database configuration
2020-09-09 03:50:46.780 UTC [66] LOG:  received SIGHUP, reloading configuration files
2020-09-09 03:50:46.781 UTC [66] LOG:  parameter "synchronous_standby_names" changed to "stolonfakestandby"
server signaled
2020-09-09 03:50:48.842 UTC [149] FATAL:  password authentication failed for user "repl_username"
2020-09-09 03:50:48.842 UTC [149] DETAIL:  Password does not match for user "repl_username".
	Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09 03:50:49.790 UTC [150] FATAL:  password authentication failed for user "repl_username"
2020-09-09 03:50:49.790 UTC [150] DETAIL:  Password does not match for user "repl_username".
	Connection matched pg_hba.conf line 5: "host replication repl_username 0.0.0.0/0 md5"
2020-09-09T03:50:51.845Z	INFO	cmd/keeper.go:1457	our db requested role is master
2020-09-09T03:50:51.847Z	INFO	cmd/keeper.go:1493	already master
2020-09-09T03:50:51.866Z	INFO	cmd/keeper.go:1626	postgres parameters not changed
2020-09-09T03:50:51.866Z	INFO	cmd/keeper.go:1653	postgres hba entries not changed

And the final 8-line loop goes on like this forever. Connecting to the db works through the proxy service, but any command is kept on hold, waiting for replication indefinitely, until I hit Ctrl-C:

kubectl exec -it stolon-postgres-keeper-0 -- psql -h 10.94.4.60 -p 5432 -U postgres -W -d postgres
Password: 
psql (12.2 (Debian 12.2-1.pgdg100+1))
Type "help" for help.

postgres=# CREATE EXTENSION UNACCENT;
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
CREATE EXTENSION
postgres=# \q

What have I missed? Can you help me?

MoskitoHero · September 9, 2020, 8:15am

OK, so I think I have found the culprit: a forbidden character was hiding in the repl_user random password I had generated on an website.

So there you go, make sure lesson learned today: make sure my passwords comply with the constraints !