Need a generic step by step instruction how to deploy a brand new cluster

utkinpol · May 22, 2020, 2:53am

Hi,

I feel something is missing in generic instructions.
i tried to follow the setup explained in the

link. Then i tried to use a generic chart for stable/stolon with minimal alterations. in all scenarios i get properly started set of pods but i am unable to connect to the db - not from the outside, nor from inside of the keeper pod. i presume it is due to the fact that the actual postgresql is not running at all, despite of the chart startup. it must be something trivial that is missed but i am not able to see what is missing.

the startup below states that cluster was initialized - is it not true? i run it all with helm 3.1.1 and a rancher system on top of k8s.
chart is taken from: charts/stable/stolon at master · helm/charts · GitHub
to reduce changes to minimum i only tried to alter passwords and

clusterSpec:
  synchronousReplication: true
  minSynchronousStandbys: 1 # quorum-like replication
  maxSynchronousStandbys: 1 # quorum-like replication
  initMode: new
...

keeping rest of the chart untouched - but it still produces same exact outcome. pls advice what is the issue there. 

 helm install pg --namespace stolon -f values.yaml ~/stolon/
NAME: pg
LAST DEPLOYED: Thu May 21 20:33:23 2020
NAMESPACE: stolon
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Stolon cluster installed and initialized.

To get superuser password run

    PGPASSWORD=$(kubectl get secret --namespace stolon stolon -o jsonpath="{.data.password}" | base64 --decode; echo)
[rancher@rancher stolon]$  PGPASSWORD=$(kubectl get secret --namespace stolon stolon -o jsonpath="{.data.password}" | base64 --decode; echo)

[rancher@rancher stolon]$ kubectl -n stolon get all
NAME                                     READY   STATUS      RESTARTS   AGE
pod/pg-stolon-create-cluster-vtftz       0/1     Completed   0          21s
pod/pg-stolon-keeper-0                   1/1     Running     0          21s
pod/pg-stolon-keeper-1                   1/1     Running     0          18s
pod/pg-stolon-keeper-2                   1/1     Running     0          16s
pod/pg-stolon-proxy-6c547c86b-rgx6s      1/1     Running     0          21s
pod/pg-stolon-proxy-6c547c86b-swc5b      1/1     Running     0          21s
pod/pg-stolon-proxy-6c547c86b-vbzx2      1/1     Running     0          21s
pod/pg-stolon-sentinel-d86997dcb-slk6v   1/1     Running     0          21s
pod/pg-stolon-sentinel-d86997dcb-v296g   1/1     Running     0          21s
pod/pg-stolon-sentinel-d86997dcb-zhd9n   1/1     Running     0          21s

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/pg-stolon-keeper-headless   ClusterIP   None            <none>        5432/TCP   21s
service/pg-stolon-proxy             ClusterIP   10.43.110.219   <none>        5432/TCP   21s

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pg-stolon-proxy      3/3     3            3           21s
deployment.apps/pg-stolon-sentinel   3/3     3            3           21s

NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/pg-stolon-proxy-6c547c86b      3         3         3       21s
replicaset.apps/pg-stolon-sentinel-d86997dcb   3         3         3       21s

NAME                                READY   AGE
statefulset.apps/pg-stolon-keeper   3/3     21s

NAME                                 COMPLETIONS   DURATION   AGE
job.batch/pg-stolon-create-cluster   1/1           1s         21s
[rancher@rancher stolon]$ kubectl -n stolon exec -it pg-stolon-keeper-0 -- psql --host 10.43.110.219 --port 5432 --username superuser_name -W
Password for user superuser_name:
psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
command terminated with exit code 2

utkinpol · May 22, 2020, 2:57am

Keeper log says the same - ‘no db assigned’. I hope it is something trivial - so, how to assign that db?

2020-05-22T02:55:51.653Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing… {“file”: “/etc/secrets/stolon/pg_repl_password”, “mode”: “01000000777”}

5/21/2020 10:55:51 PM 2020-05-22T02:55:51.653Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing… {“file”: “/etc/secrets/stolon/pg_su_password”, “mode”: “01000000777”}

5/21/2020 10:55:51 PM 2020-05-22T02:55:51.654Z INFO cmd/keeper.go:2039 exclusive lock on data dir taken

5/21/2020 10:55:51 PM 2020-05-22T02:55:51.664Z INFO cmd/keeper.go:525 keeper uid {“uid”: “keeper0”}

5/21/2020 10:55:51 PM 2020-05-22T02:55:51.685Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear

5/21/2020 10:55:56 PM 2020-05-22T02:55:56.688Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear

5/21/2020 10:56:01 PM 2020-05-22T02:56:01.692Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear

5/21/2020 10:56:06 PM 2020-05-22T02:56:06.696Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear

5/21/2020 10:56:11 PM 2020-05-22T02:56:11.700Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear

5/21/2020 10:56:16 PM 2020-05-22T02:56:16.710Z INFO cmd/keeper.go:1039 no db assigned

5/21/2020 10:56:21 PM 2020-05-22T02:56:21.717Z INFO cmd/keeper.go:1039 no db assigned

utkinpol · May 22, 2020, 3:08am

and, actually other keeper seems to be less happy. i did not notice that before, it seems.
it actually starts getting pretty frustrating:

is there anybody here who was able to run this on the persistent volumes? i am not using azure - i am using a trivial ‘hostpath’ volume plugin from the Rancher - how can it be possible not to function with postgres?

Workload: pg-stolon-keeperActive
Namespace: default
Image: sorintlab/stolon:v0.16.0-pg10
Workload Type: Stateful Set
Endpoints: n/a
Config Scale: 2
Ready Scale: 2
Created: 10:55 PM
Pod Restarts: 0

1 Item

Running	pg-stolon-keeper-1	sorintlab/stolon:v0.16.0-pg10

10.42.2.74 / Created a few seconds ago / Restarts: 0

192.168.10.252
Running pg-stolon-keeper-0 sorintlab/stolon:v0.16.0-pg10
10.42.1.142 / Created a few seconds ago / Restarts: 0

192.168.10.251

Logs: stolon Connected
ProTip: Hold the Control key when opening logs to launch a new window.
2020-05-22T02:55:54.581Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing… {“file”: “/etc/secrets/stolon/pg_repl_password”, “mode”: “01000000777”}
2020-05-22T02:55:54.581Z WARN cmd/keeper.go:182 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing… {“file”: “/etc/secrets/stolon/pg_su_password”, “mode”: “01000000777”}
2020-05-22T02:55:54.582Z INFO cmd/keeper.go:2039 exclusive lock on data dir taken
2020-05-22T02:55:54.590Z INFO cmd/keeper.go:525 keeper uid {“uid”: “keeper1”}
2020-05-22T02:55:54.609Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-05-22T02:55:59.612Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-05-22T02:56:04.615Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-05-22T02:56:09.620Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-05-22T02:56:14.624Z INFO cmd/keeper.go:1033 our keeper data is not available, waiting for it to appear
2020-05-22T02:56:19.629Z INFO cmd/keeper.go:1094 current db UID different than cluster data db UID {“db”: “”, “cdDB”: “9ba49f5e”}
2020-05-22T02:56:19.629Z INFO cmd/keeper.go:1101 initializing the database cluster
The files belonging to this database system will be owned by user “stolon”.
This user must also own the server process.
The database cluster will be initialized with locale “en_US.utf8”.
The default database encoding has accordingly been set to “UTF8”.
The default text search configuration will be set to “english”.
Data page checksums are disabled.
creating directory /stolon-data/postgres … ok
creating subdirectories … ok
selecting default max_connections … 100
selecting default shared_buffers … 128MB
selecting default timezone … Etc/UTC
selecting dynamic shared memory implementation … posix
creating configuration files … ok
2020-05-22 02:56:20.078 UTC [48] LOG: could not link file “pg_wal/xlogtemp.48” to “pg_wal/000000010000000000000001”: Operation not permitted
2020-05-22 02:56:20.079 UTC [48] FATAL: could not open file “pg_wal/000000010000000000000001”: No such file or directory
child process exited with exit code 1
initdb: removing data directory “/stolon-data/postgres”
2020-05-22T02:56:20.105Z ERROR cmd/keeper.go:1135 failed to initialize postgres database cluster {“error”: “error: exit status 1”}
2020-05-22T02:56:25.109Z ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-05-22T02:56:25.117Z INFO cmd/keeper.go:1094 current db UID different than cluster data db UID {“db”: “”, “cdDB”: “9ba49f5e”}
2020-05-22T02:56:25.117Z INFO cmd/keeper.go:1101 initializing the database cluster
running bootstrap script … The files belonging to this database system will be owned by user “stolon”.
This user must also own the server process.
The database cluster will be initialized with locale “en_US.utf8”.
The default database encoding has accordingly been set to “UTF8”.
The default text search configuration will be set to “english”.
Data page checksums are disabled.
creating directory /stolon-data/postgres … ok
creating subdirectories … ok
selecting default max_connections … 100
selecting default shared_buffers … 128MB
selecting default timezone … Etc/UTC
selecting dynamic shared memory implementation … posix
creating configuration files … ok
2020-05-22 02:56:25.645 UTC [61] LOG: could not link file “pg_wal/xlogtemp.61” to “pg_wal/000000010000000000000001”: Operation not permitted
2020-05-22 02:56:25.647 UTC [61] FATAL: could not open file “pg_wal/000000010000000000000001”: No such file or directory
child process exited with exit code 1
initdb: removing data directory “/stolon-data/postgres”
2020-05-22T02:56:25.676Z ERROR cmd/keeper.go:1135 failed to initialize postgres database cluster {“error”: “error: exit status 1”}
2020-05-22T02:56:30.681Z ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-05-22T02:56:30.686Z INFO cmd/keeper.go:1094 current db UID different than cluster data db UID {“db”: “”, “cdDB”: “9ba49f5e”}
2020-05-22T02:56:30.686Z INFO cmd/keeper.go:1101 initializing the database cluster
running bootstrap script … The files belonging to this database system will be owned by user “stolon”.
This user must also own the server process.
The database cluster will be initialized with locale “en_US.utf8”.
The default database encoding has accordingly been set to “UTF8”.
The default text search configuration will be set to “english”.
Data page checksums are disabled.
creating directory /stolon-data/postgres … ok
creating subdirectories … ok
selecting default max_connections … 100
selecting default shared_buffers … 128MB
selecting default timezone … Etc/UTC
selecting dynamic shared memory implementation … posix
creating configuration files … ok

utkinpol · May 22, 2020, 3:44am

so, after dropping all the root@pg-stolon-keeper-1:/# ps -ef
UID PID PPID C STIME TTY stolon 1 0 0 03:23 ? stolon 78 1 0 03:23 ? stolon 80 78 0 03:23 ? stolon 81 78 0 03:23 ? stolon 82 78 0 03:23 ? stolon 83 78 0 03:23 ? stolon 84 78 0 03:23 ? stolon 85 78 0 03:23 ? stolon 126 78 0 03:23 ? root 1139 0 0 03:29 ? root 1146 1139 0 03:29 ? root 1147 1146 0 03:29 ? root 1148 1147 0 03:29 pts/0 root 1149 1148 0 03:29 pts/0 root 1727 0 0 03:31 ? root 1734 1727 0 03:31 ? root 1735 1734 0 03:31 ? root 1736 1735 0 03:31 pts/1 root 1737 1736 0 03:31 pts/1 root 1787 1737 0 03:32 pts/1 connected storage and making volumes locally in the filesystem, which is absolutely useless, darn, it started and cranked up. i can see the processes now running in the keepers:
TIME CMD
00:00:02 stolon-keeper --data-dir /stolon-data
00:00:00 postgres -D /stolon-data/postgres -c unix_socket_directories=/tmp
00:00:00 postgres: checkpointer process
00:00:00 postgres: writer process
00:00:00 postgres: wal writer process
00:00:00 postgres: autovacuum launcher process
00:00:00 postgres: stats collector process
00:00:00 postgres: bgworker: logical replication launcher
00:00:00 postgres: wal sender process repluser 10.42.2.76(57820) streaming 0/3000140
00:00:00 /bin/sh -c TERM=xterm-256color; export TERM; [ -x /bin/bash ] && ([ -x /usr/bin/script ] && /usr/bin/script -q -c “/bin/bash” /dev/null || exec /bin/bash) || exec /bin/sh
00:00:00 /bin/sh -c TERM=xterm-256color; export TERM; [ -x /bin/bash ] && ([ -x /usr/bin/script ] && /usr/bin/script -q -c “/bin/bash” /dev/null || exec /bin/bash) || exec /bin/sh
00:00:00 /usr/bin/script -q -c /bin/bash /dev/null
00:00:00 sh -c /bin/bash
00:00:00 /bin/bash
00:00:00 /bin/sh -c TERM=xterm-256color; export TERM; [ -x /bin/bash ] && ([ -x /usr/bin/script ] && /usr/bin/script -q -c “/bin/bash” /dev/null || exec /bin/bash) || exec /bin/sh
00:00:00 /bin/sh -c TERM=xterm-256color; export TERM; [ -x /bin/bash ] && ([ -x /usr/bin/script ] && /usr/bin/script -q -c “/bin/bash” /dev/null || exec /bin/bash) || exec /bin/sh
00:00:00 /usr/bin/script -q -c /bin/bash /dev/null
00:00:00 sh -c /bin/bash
00:00:00 /bin/bash
00:00:00 ps -ef

so it is some progress. it has been a truly frustrating experience to deal with this all, so far. most trivial things seem to be braking at most unexpected places.

[rancher@rancher ~]$ kubectl -n default exec -it pg-stolon-keeper-0 – psql --host 10.43.70.50 --port 5432 --username stolon postgres -W
Password for user stolon:
psql (10.12 (Debian 10.12-1.pgdg90+1))
Type “help” for help.

postgres=#